
Welcome to the 7 new deep divers who joined us since last Wednesday.
If you haven’t already, subscribe and join our community in receiving weekly AI insights, updates and interviews with industry experts straight to your feed.
Imagine you’re handed a high‑performance sports car and asked, “Is this safe?” You wouldn’t just glance under the bonnet, apply a coat of paint, and leave it at that.
Yet that’s effectively what happens when organisations evaluate only the post‑mitigation version of an AI model. Research scientists Dillon Bowen et al., in their March 2025 position paper, argue that truly trustworthy evaluation requires two critical snapshots: one before safeguards, and one after.
When we asked Roman Yampolskiy (AI Author and Director at Cybersecurity Lab, University of Louisville) what one thing he wished he could change about the way organisations are working with AI, he said:
“If I could change one aspect of current AI development, it would be to instill a stronger culture of 'safety-first' in the AI community.”
We’ve pulled key points from the paper by Bowen et al. to consider whether implementing both pre- and post-mitigation checks could significantly improve AI safety.
Most AI companies today report how well their mitigated models refuse harmful prompts. Yet Bowen and colleagues point out that this is only half the story. A model may perform admirably under test conditions – but if its pre‑mitigation self could already produce dangerous advice, that performance is a mirage
The researchers frame four key scenarios that show why combining both perspectives is essential:
Only by seeing both before and after can company leaders and regulators make informed deployment choices.
The researchers’ survey of major frontier AI labs uncovered worrying gaps – such as evaluations that discuss refusal rates but omit pre-mitigation benchmarks, and vague disclosures that aren’t clearly connected with specific processes or checks.
The industry’s approach is, at best, inconsistent. A lack of standardisation leaves room for ineffective controls.
So the paper laid out three recommendations:
The paper’s authors addressed counterarguments. For example, in response to the idea that just one stage (to either prove a model inherently safe or show post-mitigation robustness) is enough, they argued that neither perspective alone equips you to anticipate future risks – especially as a model’s capabilities grow.
And we want to hear from you. Could this be a wake-up call – if we’re serious about trustworthy AI, do we need to systematically report both views: what a model can do unfiltered, and how effective our controls really are under pressure?
Will this be an important step in the development of safer AI?
We’ll see you back here in your inbox next week.