How to build trustworthy AI

How to build trustworthy AI

Welcome to the 7 new deep divers who joined us since last Wednesday.

If you haven’t already, subscribe and join our community in receiving weekly AI insights, updates and interviews with industry experts straight to your feed.


DeepDive

Imagine you’re handed a high‑performance sports car and asked, “Is this safe?” You wouldn’t just glance under the bonnet, apply a coat of paint, and leave it at that.

Yet that’s effectively what happens when organisations evaluate only the post‑mitigation version of an AI model. Research scientists Dillon Bowen et al., in their March 2025 position paper, argue that truly trustworthy evaluation requires two critical snapshots: one before safeguards, and one after. 

When we asked Roman Yampolskiy (AI Author and Director at Cybersecurity Lab, University of Louisville) what one thing he wished he could change about the way organisations are working with AI, he said: 

“If I could change one aspect of current AI development, it would be to instill a stronger culture of 'safety-first' in the AI community.” 

We’ve pulled key points from the paper by Bowen et al. to consider whether implementing both pre- and post-mitigation checks could significantly improve AI safety. 

The blind spot: Assessing just the aftermath 

Most AI companies today report how well their mitigated models refuse harmful prompts. Yet Bowen and colleagues point out that this is only half the story. A model may perform admirably under test conditions – but if its pre‑mitigation self could already produce dangerous advice, that performance is a mirage

Why two views matter 

The researchers frame four key scenarios that show why combining both perspectives is essential:

  1. No dangerous capabilities and strong refusal = solid case for safe deployment.
  2. No dangerous capabilities but weak refusal = mitigations may need reinforcing.
  3. Dangerous capabilities but strong refusal = may be safe for API release, but white‑box access remains a no‑go.
  4. Dangerous capabilities and weak refusal = clearly unsafe; mitigation must improve.

Only by seeing both before and after can company leaders and regulators make informed deployment choices.

What are the core recommendations? 

The researchers’ survey of major frontier AI labs uncovered worrying gaps – such as evaluations that discuss refusal rates but omit pre-mitigation benchmarks, and vague disclosures that aren’t clearly connected with specific processes or checks. 

The industry’s approach is, at best, inconsistent. A lack of standardisation leaves room for ineffective controls. 

So the paper laid out three recommendations: 

  1. Mandatory disclosure of pre‑mitigation ‘dangerous capability’ evaluations, so we know what a model could do, not just what it does after safeguards.
  2. Post‑mitigation refusal evaluations, to confirm that safety mechanisms hold even under adversarial prompting.
  3. Standardised evaluation practices, including consistent methodology and transparency, so regulators and researchers can compare apples with apples.

We want to know what you think 

The paper’s authors addressed counterarguments. For example, in response to the idea that just one stage (to either prove a model inherently safe or show post-mitigation robustness) is enough, they argued that neither perspective alone equips you to anticipate future risks – especially as a model’s capabilities grow. 

And we want to hear from you. Could this be a wake-up call – if we’re serious about trustworthy AI, do we need to systematically report both views: what a model can do unfiltered, and how effective our controls really are under pressure? 

Will this be an important step in the development of safer AI? 

We’ll see you back here in your inbox next week.

Related
articles