Is AI ready for autonomy in finance

Is AI ready for autonomy in finance

Welcome to the 7 new deep divers who joined us since last week.

If you haven’t already, subscribe and join our community in receiving weekly AI insights, updates and interviews with industry experts straight to your feed.


DeepDive 

As excitement builds around the potential for LLMs to transform high-stakes industries, a new benchmark has delivered insights into just how far these models have to go – especially in finance. 

In a detailed report, Finance Agent Benchmark, researchers from Vals AI and Stanford University present the most rigorous evaluation yet of LLMs tasked with real-world financial research. 

They found that even the best-performing AI agent answered fewer than half the questions correctly; and no model exceeded 46.8% accuracy.

Importantly, these results came from reason questions drawn from the day-to-day work of financial analysts, investment researchers, and deal teams – not from esoteric academic tests. 

The promise and the problem

Financial workflows are ripe for automation. Entry-level analysts in banks and hedge funds often spend up to 40% of their time collecting and extracting data rather than analysing it. And LLMs, especially when they’re equipped with retrieval tools and search capabilities, seem well-positioned to reduce that load.

But real-world finance is not a closed-book exam. Analysts pull from SEC filings, company reports, earnings calls, press releases, and other sources; all of which require multi-step reasoning, fact extraction, and contextual understanding. 

That’s the bar this benchmark sets for LLMs. 

What did the benchmark measure?

The team developed a dataset of 537 expert-authored questions spanning nine categories – from basic financial retrieval to complex modeling and market analysis. Each question was backed by a publicly verifiable answer using SEC filings and was put through rigorous peer review before being included in the benchmark. 

Then, to make the evaluation realistic, models were equipped with tools (instead of just being prompted). These included access to:

  • Google Search
  • The SEC’s EDGAR database
  • Custom HTML parsing and document retrieval tools

This allowed them to simulate the workflow of a financial analyst: search, retrieve, reason, and report.

The models were then scored using an LLM-as-judge rubric system. Rather than giving points for surface-level similarity, the evaluators broke down each answer into factual checks and contradiction tests.

The results are eye-opening 

  • OpenAI’s o3 model came out on top with 46.8% class-balanced accuracy – meaning it was only slightly better than a coin toss across question types.
  • Claude 3.7 Sonnet followed closely at 45.9%, with marginally faster execution and lower cost.
  • Lower-tier models scored in the single digits. One version of LLaMA 3.3 Instruct scored just 2.8%.

In cost terms, models like o3 ($3.79 per query) remain significantly cheaper than human analysts ($25.66 per task). But the stark performance gap underlines that today’s AI is a productivity tool, and it’s not ready to become a replacement. We still need humans in the loop. 

What makes one model perform better than another?

The study also offers insight into different performances from different models; with patterns of behaviour that point to better results. 

  • More exploration helps: High-performing models made more tool calls and took more ‘turns’ before settling on a final answer.
  • Blind repetition hurts: GPT-4o Mini made the highest number of tool calls – but often repeated the same failing action (e.g., calling EDGAR after an error) without adapting its strategy.
  • Effective tool usage matters more than quantity: Models that balanced their use of search, retrieval, and parsing performed better than those that over-relied on a single method.

These findings suggest that AI’s future in finance requires smarter agents that know how and when to use tools effectively.

Where do we go from here?

For AI professionals building products for finance, this benchmark sets a new standard.

It exposes the current limitations of LLMs in autonomous decision-making – especially in complex domains where a small error can result in massive financial consequences.

It also reinforces the need for rigorous evaluation frameworks. Benchmarks like FinQA or TAT-QA, which rely on static datasets, miss the nuance of live information retrieval and real-time reasoning. Finance Agent Benchmark fills that gap.

Importantly, while the models struggled with complex modeling or analysis, they performed better on basic retrieval tasks. This means financial institutions could safely deploy AI to support analysts on routine data extraction while keeping humans in the loop for anything strategic or high-risk.

But this research doesn’t signal failure. Instead, it marks a new baseline for progress; and if we look at how far LLMs have come over the last two years, that baseline is amazing. As the report notes, even today's models are orders of magnitude faster and cheaper than humans for certain tasks.

The path to reliable autonomous AI in finance will require better training on domain-specific reasoning; advanced agent frameworks that are capable of adaptive strategy; and deeper integration with structured data sources (like spreadsheets and proprietary systems).

Until then, AI in finance should be seen as a force multiplier, not a fully autonomous analyst.

Related
articles