Could open models be safe enough to trust?

29 Oct 2025

Welcome to the 6 new deep divers who joined us since last Wednesday.

If you haven’t already, subscribe and join our community in receiving weekly AI insights, updates and interviews with industry experts straight to your feed.

DeepDive

Imagine a world where you could deploy an open-weight generative model, with every parameter available for inspection, and still pass tough security checks. No vendor lock-in. No black boxes. We’re talking full transparency, with safety resilient enough for regulated industries.

Until recently, that sounded optimistic. But new research on guardrails suggests the safety gap is narrowing – and, with the right architecture, it might keep narrowing over time.

Guardrails under stress (why static tuning isn’t enough)

Researchers at Tufts and Indiana University have shown that injecting Gaussian noise into model activations can degrade safety behaviours in open-weight LLMs. Harmful-output rates rose by up to 27% in their tests, and ‘deeper’ safety fine-tuning didn’t reliably help.

The key takeaway from this research is that alignment fine-tunes can be brittle under stress.

That doesn’t mean open models can’t be used safely. But it does mean safety needs to be multi-layered and runtime-aware, not just a one-off training pass.

Practical protection layers (what the new systems do)

LlamaFirewall (an open-source framework from Meta researchers) wraps models and agents with three complementary defences:

PromptGuard 2 (jailbreak detection)
AlignmentCheck (runtime reasoning auditor)
CodeShield (static analysis for generated code).

In evaluations on the AgentDojo benchmark, combining PromptGuard 2 and AlignmentCheck cut attack success rate from 17.6% to 1.75% (a >90% reduction) with a utility drop from 0.477 to 0.427 – a trade-off the authors characterise and discuss. This is still offline/benchmark testing, but it’s a concrete, transparent template for layered defences.

Another complementary direction is AdaptiveGuard: a runtime safety system that treats novel jailbreaks as out-of-distribution inputs and learns to defend post-deployment. Reported results include 96% OOD detection accuracy and adaptation ‘in just two update steps,’ while retaining over 85% F1 on in-distribution data after adaptation. That adaptability matters because attackers iterate – and static guardrails lag.

Coming back to the question of trust

It’s the big question across AI development and adoption right now: what will enable people to truly trust in the tech?

With this in mind, for enterprises, the open-vs-closed decision isn’t only about raw capability. It’s also very much about auditability and assurance. Closed models hide weights/training data; safety claims rely largely on vendor assertions. Open models enable independent inspection and benchmarking, including external review of guardrails like LlamaFirewall or AdaptiveGuard.

But for those of us working within the AI space, it’s important to remember that technical safety isn’t enough. When we interviewed previous DeepFest speaker Angela Kane (Vice President at the International Institute for Peace, Vienna; Former UN High Representative for Disarmament Affairs), she said:

“Right now the discussion about AI and its benefits – or drawbacks – takes place in highly-educated elite circles. Most people are afraid to enter into discussions for fear of showing lack of knowledge or awareness.”

As we design and implement better guardrails, we also need clearer reporting and inclusive dialogue so people who aren’t AI experts can see how safety really works.

Moving towards an open ecosystem

The open-model ecosystem keeps advancing. Llama 3/3.1/3.2 broadened the open options in 2024, and Mixtral established open-weight Mixture-of-Experts models; while Falcon pushed open access earlier. More capable open weights raise the stakes for independent safety and repeatable assurance.

Collaboration is critical to keep moving towards this open, trustworthy ecosystem. And at DeepFest, we’re committed to our own role in enabling that movement. As Cathy Hackl (CEO at Spatial Dynamics) put it:

“It’s more than getting inspiration; DeepFest is about creating new connections, seeking new collaborators and engaging with some of my peers in the AI and spatial computing industries.”

Open models thrive when ecosystems share tools, datasets, evaluations and guardrail patterns. That’s as much a community project as a technical one.

What to watch next

Adaptive guardrails: Work like AdaptiveGuard points to a shift from static moderation to learning safety that updates with new attacks.
Independent certification: Expect more third-party, open benchmarks (AgentDojo-style tests) and publishable safety claims for open weights.
Policy interplay: As assurance practices mature, we’ll learn whether open + guarded can meet compliance expectations in sensitive sectors – with transparent artefacts for auditors.

The open-vs-closed debate may soon feel less like a question of which is safer, and more like a question of which can demonstrate and maintain safety – quickly, transparently, and under pressure.

Join us at DeepFest 2026 to explore the latest development in AI governance and safety. Register now – we can’t wait to see you there.

PRE-REGISTER FOR DEEPFEST 2026

Share on: