AI Safety starts before the Model
Governance, risk allocation, and system design are doing more work than your benchmark score.
There is a particular kind of confidence that comes from a good safety benchmark score. Your model refused harmful prompts. It passed the red-team suite. The evaluation report is clean. You ship. [7]
Then something goes wrong — not because the model misbehaved in a way your tests anticipated, but because a deployment decision made months earlier gave it more autonomy than the task required. Or because no one defined what a safety incident actually was, so no one knew they were having one. Or because the training data contained a subtle contamination that no benchmark was designed to catch. [10, 11]
The benchmark told you the model was safe. It said nothing about the system. [11]
AI safety is fundamentally a governance, risk allocation, and system design problem. The model is one layer. It is not the whole stack. [11]
What 'safe AI' actually means — and why most teams pick the wrong definition
When practitioners talk about AI safety, they usually mean one thing: the model does not produce harmful outputs when prompted. This is a reasonable starting point. It is not a complete framework. [11]
There are at least four distinct things that 'safe AI' can mean, and they operate at different levels of the stack:
– The model does not produce harmful outputs — it refuses appropriately, avoids toxicity, resists jailbreaks. This is what benchmarks measure. [7, 8]
– The system is resilient to adversarial use — it holds up against indirect prompt injection, multi-turn manipulation, and attacks that exploit context rather than single prompts. This is what red-teaming measures, imperfectly. [4, 10]
– The organisation has accountability structures — someone owns safety decisions, external evaluators have independent publishing rights, and there are defined escalation paths when something goes wrong. This is what governance frameworks measure. [6, 11]
– The product design does not cause harm by design — the system does not surveil users, exploit vulnerabilities in the people it serves, or make consequential decisions without human checkpoints. This is what impact assessments measure. [1, 2, 11]
Most AI teams operate almost entirely within definition one. When they expand, they move to definition two. Definitions three and four — the ones that require organisational and design decisions rather than technical fixes — are where the most consequential failures actually live. [6, 11]
The reframe matters because it changes what you invest in. If safety is primarily a model behaviour problem, you invest in evaluations and guardrails. If it is also a governance and design problem, you invest in accountability structures, threat modelling, and deployment criteria. Both are necessary. The second category is underinvested in almost every AI team building today. [11]
What happens before the model is chosen
The most important safety decisions are made before a model is selected, fine-tuned, or deployed. They are made when a team decides what the product is for, who it serves, and what the system is permitted to do autonomously. [1, 3, 11]
Risk classification comes first. The EU AI Act defines four risk tiers — unacceptable, high-risk, limited risk, and minimal risk — and since February 2025, the unacceptable category is legally banned within the EU. Whether or not your jurisdiction requires compliance, this classification exercise forces a useful discipline: what is this system actually doing, and what are the failure modes that matter? [1, 2]
Threat modelling follows. Adapted from the STRIDE framework for AI systems, this means asking: what inputs can adversaries control? What can the model do autonomously? What outputs go where, and what happens downstream? This is not a one-time exercise — the threat surface expands every time the system gains a new capability. [3, 10]
Training data governance is where the invisible debt accumulates. Web-scraped data contains CSAM, hate speech, personally identifiable information, and malicious code. Even a small contamination — research has shown as few as 250 malicious documents in a training set — can cause targeted harmful outputs. Dataset poisoning risk is not hypothetical. It is a known attack vector that is routinely underweighted in pre-training hygiene. [3, 5]
None of these decisions involve the model. All of them shape what the model can and cannot do safely. [3, 5, 11]
The organisational prerequisites that most assessments skip
There are two very different kinds of organisations in the AI world right now. The first kind builds the models — companies like OpenAI, Anthropic, or Google. They decide what the model is trained on, what it can do, and how safe it is at a fundamental level. The second kind builds products and services on top of those models, using commercial APIs. Most companies fall into this second category.
The rules are not the same for both. NIST’s AI Risk Management Framework makes this explicit: it assigns different responsibilities to different roles in the AI chain, and notes that guidance written for AI developers “may not be relevant to AI deployers.” When you are using a third-party model, you did not build it and you cannot fully see inside it — NIST acknowledges this directly, noting that risk tolerances “may not align” between the provider and the organisation deploying their technology. That gap is real. It does not mean the deploying organisation has no responsibilities. It means the responsibilities are different ones — and more specific ones. A company building on a commercial API is responsible for the decisions it makes about how to use the model: what the product does, who it serves, how much autonomy the system is given, and what happens when it gets something wrong. [11]
Those are deployment decisions, not model decisions. They belong entirely to the organisation doing the deploying. The model provider cannot make them for you, and a good benchmark score from the provider does not cover them. A model that performs well in evaluation can still be deployed irresponsibly — into the wrong use case, with too much autonomy, with no one owning the risk, and no process for catching problems early. That is where most real-world harm actually originates: not in the model, but in the decisions made around it.
The OECD AI Principles put it simply: AI actors should manage risk “based on their roles, the context, and their ability to act.” Your role is not to publish a model specification or evaluate existential risks at the frontier. Your role is to govern your own deployment well — to define the use case clearly, classify the risk it carries, decide how much autonomy the system should have, and make sure someone owns the problem when something goes wrong. That is the scope of accountability that belongs to an organisation building on a commercial API, and it is a meaningful scope. [11]
Four questions that separate organisations structurally capable of safe AI deployment from those that are not:
– Has the organisation defined which risk category its AI deployment falls into, and documented what obligations follow from that classification? Without this, risk management is informal and inconsistent. [1, 11]
– Are there clear deployment criteria — what conditions must be satisfied before an AI system goes live internally or externally, and who has authority to halt deployment if those conditions are not met? [11]
– Does someone in the organisational structure own AI-related risks as a defined accountability, not a secondary responsibility? Absence of clear ownership is how known risks go unaddressed. [11]
– Is there a working escalation path for AI-related concerns, and do the people closest to the system know about it? Internal visibility of problems is a prerequisite for addressing them. [11]
If governance, risk ownership, deployment criteria, and escalation paths are vague, no benchmark score will save the system. [11]
Model behaviour as one layer, not the whole stack
Benchmarks and guardrails do matter. The question is what they can and cannot tell you. [7, 8, 9]
MLCommons AILuminate — one of the more rigorous publicly available harm benchmarks as of 2025 — covers 12 hazard categories and uses graded evaluation rather than simple binary pass/fail reporting. It is a material step forward over many earlier public safety evaluations, but it still does not close the evaluation gap on its own. [7, 8]
But it operates on the assumption that the prompts in the dataset represent the prompts your adversaries will use. They will not. Once a benchmark is public, it gets gamed — not always deliberately, but because fine-tuning against known eval data is the path of least resistance. A model scoring perfectly on a public benchmark may still be vulnerable to novel attacks not in that dataset. [7, 8, 10]
XSTest is worth mentioning for a different reason: it tests for over-refusal — cases where a model incorrectly refuses a benign request because it pattern-matches to something harmful. Related work such as OR-Bench extends this line of evaluation at larger scale. A model that refuses questions about medication in a clinical context is not safer; it is less useful and can be harmful in the opposite direction. Both failure modes matter, and a safety programme that only measures under-refusal is measuring only half the problem. [9, 12]
The point is not that evaluation is futile. It is that a benchmark score is a lower bound on actual safety, not a ceiling — and treating it as the latter is the most common mistake in how AI safety is practised today. [7, 8, 9, 11]
Deployment and architecture decisions that determine safety outcomes
The OWASP Agentic AI Top 10 — published in December 2025 and developed with input from over 100 security researchers — introduces a principle that should be foundational to any AI product with autonomous capabilities: the principle of least agency. [10]
The principle is simple: only grant agents the minimum autonomy required for safe, bounded tasks. If the task does not require web browsing, do not give the agent a browser. If it does not require file system access, do not grant it. Capability restrictions are not limitations on product quality — they are first-line safety controls. [10]
Human-in-the-loop design deserves the same level of deliberate attention. For high-stakes decisions — medical, legal, financial, or any action with significant real-world consequences — the system should be designed so consequential actions require human confirmation rather than being executed autonomously. This is an architectural decision, not a feature you can add later. [11]
Output handling is underestimated as a failure surface. Model outputs are untrusted. Rendering model output directly — executing it, displaying it without sanitisation, passing it to downstream systems — creates risk that exists independently of how well the model behaved. Guardrail architecture that applies only to inputs, not outputs, is half a guardrail. [10, 11]
Semantic rate limiting goes beyond token-rate limiting. Adversarial campaigns against model constraints often use varied phrasing to avoid detection — the same intent expressed twenty different ways across twenty turns. Systems that rate-limit only by volume miss this pattern entirely. Systems that track semantic intent across turns do not. [10]
What we still cannot measure — an honest accounting
Intellectual honesty about the limits of evaluation is itself a safety practice. Teams that believe their safety programme is more complete than it is take risks they do not know they are taking. [11]
Benchmark contamination is structural. Once a benchmark is published, labs — deliberately or incidentally — fine-tune against it. This is not unique to AI; it is Goodhart's Law applied to safety evaluation. The measure becomes the target, and the target stops measuring what it was designed to measure. [7, 9]
Distribution shift is the harder problem. The prompt space is effectively infinite. Current benchmarks rely on predefined prompts that target well-understood failure modes. Unknown failure modes — vulnerabilities that manifest in deployment but were not anticipated in evaluation — are, by definition, largely unexplored. [11]
The evaluation-deployment gap is where most real-world harm occurs. Controlled benchmark conditions do not capture the creativity and persistence of actual adversaries, the long-tail of real user behaviour, or the emergent risks of novel use cases the team did not design for. [10, 11]
Aggregation is an unsolved problem. There is no principled way to combine scores across dimensions — safety, truthfulness, fairness, privacy — into a single safety verdict. Any aggregate score involves subjective weighting decisions that should be made explicit but rarely are. [11]
This is not a counsel of despair. Evaluation is necessary and worth doing carefully. It means that safety assessment must be treated as a continuous, sceptical, multi-layered process rather than a one-time certification event. [11]
The decisions that actually determine safety outcomes
If safety is a governance, risk allocation, and system design problem, then the questions that determine safety outcomes are not 'what did the model score on AILuminate?' They are the decisions made in design reviews, deployment conversations, and incident post-mortems. [11]
Five questions worth making explicit in every AI deployment:
– At what benchmark failure rate do you delay deployment? This number should exist before evaluation begins. If it does not, the evaluation will be interpreted to fit the shipping date rather than the risk threshold. [11]
– What constitutes a safety incident? If this is not defined in advance, it will be defined in retrospect by whoever is most motivated to minimise the count. [11]
– Who owns the deployment decision when evaluation results are ambiguous? Ambiguity in evaluation is the norm, not the exception. The accountability structure for that decision needs to exist before the ambiguity arises. [6, 11]
– What does the system do in edge cases it was not designed for? Novel use cases are inevitable. The architecture should fail gracefully — restricting autonomy, routing to a human, or declining — rather than attempting a harmful completion. [10, 11]
– What would cause you to roll back? Retraining triggers and rollback criteria should be defined before deployment, not derived from the severity of the incident that prompted them. [11]
These are not technical questions. They are governance questions. They require people with authority and accountability in the room, not only people with technical expertise. [11]
The question to ask before the next model decision
Before the next model is selected, fine-tuned, or deployed: are the conditions in place for this system to be safe, or are we assuming the model will compensate for what the organisation and architecture have not provided? [11]
A model with excellent benchmark scores deployed into a system with excessive agency, no incident definition, and no escalation path is not a safe product. A model with modest benchmark scores deployed into a system with clear risk ownership, tight capability scoping, and human checkpoints for consequential decisions is considerably safer than the scores suggest. [10, 11]
The benchmark score is real information. It is one input into a decision that is ultimately about governance, architecture, and accountability. Treating it as the decision is the mistake that most AI safety failures have in common. [11]
Safety starts before the model. If governance, risk ownership, deployment criteria, and escalation paths are vague, no benchmark score will save the system. [11]
References
[1] European Commission. "AI Act | Shaping Europe’s digital future." Application timeline states that prohibited AI practices entered into application on 2 February 2025.
[2] EU AI Act, Article 5. Prohibited practices include manipulative systems and systems that exploit vulnerabilities due to age, disability, or social/economic situation.
[3] Microsoft Learn. "Threat Modeling AI/ML Systems and Dependencies." Guidance on AI/ML threat modeling, poisoned training data, and adapting threat-modeling practice to AI systems.
[4] OWASP Cheat Sheet Series. "LLM Prompt Injection Prevention Cheat Sheet." Describes direct and indirect prompt injection risks in LLM applications.
[5] Anthropic, UK AISI, and The Alan Turing Institute. "A small number of samples can poison LLMs of any size." Reports that injecting 250 malicious documents can backdoor models in their experimental setup.
[6] Future of Life Institute. "AI Safety Index – Summer 2025." Evaluates seven leading AI companies across 33 indicators and six domains; reports no company above D in existential safety planning and only three firms with substantive dangerous-capability testing.
[7] MLCommons. "AILuminate Benchmark." Describes the Safety v1.0 benchmark scope, 12 hazard categories, public/private prompts, and broader benchmark family.
[8] MLCommons. "AILuminate official benchmark methodology and results." Describes the 5-point grading scale: Poor, Fair, Good, Very Good, Excellent.
[9] Cui et al. "OR-Bench: An Over-Refusal Benchmark for Large Language Models." Large-scale over-refusal benchmark and discussion of the safety-helpfulness trade-off.
[10] OWASP GenAI Security Project. "OWASP Top 10 for Agentic Applications for 2026." Published 9 December 2025; developed with more than 100 experts and used here for least-agency and agentic security guidance.
[11] NIST. "Artificial Intelligence Risk Management Framework (AI RMF 1.0)." Risk management, governance integration, risk prioritization, and the need to halt development/deployment where unacceptable risks are present.
[12] Röttger et al. "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models." Source on over-refusal and false-refusal evaluation.