When agents backtrack, AI starts to scale

When agents backtrack, AI starts to scale

If you haven’t already, subscribe and join our community in receiving weekly AI insights, updates and interviews with industry experts straight to your feed.


DeepDive 

Your weekly immersion in AI 

You’re in an escape room. You’ve cracked the cipher (or you think you have, anyway), half-solved the logic grid, and your answer to the riddle feels…plausible. 

Then the door doesn’t open. 

Was the riddle wrong – or did you make a tiny mistake ten minutes ago that’s now snowballing into failure? 

That is, pretty much, the headache of today’s AI agents. 

As the team at Asari AI put it in this article, even if each step is 95% reliable, long sequences can collapse to ‘virtually zero’ success overall.

Separating what an agent does from how it searches

A new framework called EnCompass argues that part of the problem is how we program agents today.

In the research paper (by Li, Solar-Lezama, Yue, Zheng), the authors note that agent builders often tangle two things together:

  • the workflow logic (the steps you want the agent to follow), and
  • the inference-time strategy (how you search over the many possible outcomes of LLM calls – tree search, beam search, backtracking, and so on).

Their proposed fix is a programming model called probabilistic angelic nondeterminism (PAN), implemented in Python as EnCompass. The promise looks like this: write the workflow cleanly, then swap search strategies without rewriting the workflow every time.

MIT’s write-up describes the developer experience in practical terms. Instead of hand-coding complicated loops and retry logic, you annotate where results might vary (a ‘branchpoint’) and where outcomes should be scored, and EnCompass handles the backtracking – even cloning program runtimes to explore attempts in parallel.

If that sounds like ‘choose-your-own-adventure’ storytelling for code… MIT explicitly leans into that metaphor too.

Does it actually save time?

The most concrete story here is engineering effort.

MIT reports EnCompass reduced coding effort for implementing search by up to 80% across agents. In one Java-to-Python repository translation example, using EnCompass meant 348 fewer lines of code – about an 82% reduction compared to implementing search by hand.

Caltech’s coverage offers a crisp comparison on the same theme: one beam-search approach required 75 lines of code with EnCompass versus 423 without it, and overall search-related code can drop by about 80%.

Agent development isn’t just about making it work once – it’s about iterating quickly when the world changes or the workflow grows. And EnCompass could help that happen. 

Reliability isn’t free – but it might be easier to buy

There’s a trade to be made here. EnCompass can help you search harder for a correct path – but search costs compute.

MIT reports that, across five repositories, a two-level beam search delivered an accuracy boost of 15%-40%, using a search budget of 16× the LLM calls compared to running without search. Caltech similarly reports accuracy rising from 15% to 40% on several repositories in their translation framing.

So the question changes from whether agents can do multi-step work, to what’s the best way to spend compute if you want to make those agents dependable – without making the codebase incredibly complex. 

What to watch next

Two things are worth tracking as this line of work matures: 

  1. Standard patterns for agent programming. If ‘workflow + annotations + plug-in search’ becomes a norm, it could make agent systems easier to inspect, test, and govern.
  2. Smarter search budgets. 16× is a clear, honest number – and a reminder that reliability often has a price tag. The interesting next step is learning when to spend that budget, and when not to.

If you’re building with agents, you’ve almost definitely felt the pain EnCompass is targeting: every retry loop you add makes the system harder to change later. This work suggests a more modular future – where finding the best path becomes a configurable strategy rather than a rewrite.

What do you think?

Will agent development converge on these more structured programming models – or will end-to-end ‘LLM decides everything’ approaches dominate?

Open this newsletter on LinkedIn and share your perspective. We might reach out to feature your comments in a future edition of DeepDive.

Related
articles