Back to Blog
SimulationsJan 31, 2026 15 min read

Threat and Reliability Simulations for AI: The Two Pillars of Agent Evaluation

Robust simulation frameworks enable teams to scale AI responsibly. Without proper testing, teams resort to firefighting, addressing failures only after they occur, often introducing new issues while fixing old ones.

Agent simulations surface potential problems and unintended behaviors early, delivering exponential value as AI systems mature and expand. As AI agents become increasingly autonomous and capable, the traditional approach to evaluation focused primarily on whether agents can complete tasks is no longer sufficient. The real question isn't just whether an agent can do something, but whether it does so safely and reliably.

At the intersection of capability testing lies two critical simulation dimensions that every agent must address: threat simulation and reliability simulation. While traditional evaluations ask 'what can this agent do?', these two pillars ask fundamentally different questions: 'how vulnerable is this agent to attack?' and 'can we trust this agent to behave consistently?'

Through our internal research work and with design partners, we have learned how to design more rigorous and useful simulations for agents. Here's what's worked across a range of agent architectures and use cases in real-world deployment.

Why build Simulations?

In the early stages of agent development, teams can make significant progress by relying on hands-on testing, using the product themselves, and trusting their instincts. Rigorous simulation can feel like overhead that slows shipping, but once an agent is live and scaling, building without it starts to break down. Teams end up flying blind debugging reactively, unable to distinguish real regressions from noise or measure progress. Simulation forces teams to define what success looks like early on and holds the quality bar steady as the agent evolves. It also accelerates model adoption: teams with simulation can upgrade in days, while others face weeks of manual testing. Once in place, it provides baselines, regression tracking, and a clear set of metrics for product and research to align on. The compounding value is easy to underestimate: costs are upfront, but benefits accumulate over time.

How to run Simulations for AI agents?

When we started thinking about how to run simulations for AI agents, we kept coming back to two fundamental questions: Can this agent be broken? and Can this agent be trusted?

The first question matters because in an agentic world, a compromised agent isn't just a security flaw; it's an autonomous actor that can cause damage at scale, far beyond what a human adversary could do manually. The second question matters because AI is inherently non-deterministic. You can't just test an agent once and call it good; the same agent can behave differently each time it runs.

These two questions became the foundation of how we structured our approach to simulation and we built it around two core pillars: Threat Simulation, which stress-tests agents against adversarial attacks, and Reliability Simulation, which ensures agents behave consistently and predictably in production.

Pillar 1: Threat Simulation - Security Through Adversarial Testing

Threat simulation involves systematically testing agents against security vulnerabilities through red teaming and adversarial scenarios. Rather than measuring what an agent can accomplish, threat simulation evaluates what an agent can be forced to do by malicious actors.

Threat simulation systematically tests agents against security vulnerabilities through red teaming and adversarial scenarios. Rather than measuring what an agent can accomplish, it evaluates what an agent can be forced to do by malicious actors. Our simulations cover the most critical attack surfaces:

  • Goal & Instruction Hijacking, where attackers attempt to redirect agent behavior through prompt injection or context manipulation;
  • Tool Abuse & Unsafe Actions, which tests whether agents stay within their permitted boundaries or can be pushed into unauthorized actions;
  • Identity & Privilege Abuse, where we simulate delegation and impersonation attacks;
  • Unexpected Code Execution, testing whether agents maintain sandbox boundaries against remote code execution or malicious injection; and
  • Human Trust Exploitation, which evaluates whether agents can be tricked into deceiving or manipulating end users.

To measure how well an agent holds up, we produce a Risk Score that aggregates performance across all these categories tracking the success rate of attacks, severity of any exploits, time to detection, and the effectiveness of the agent's defenses. The goal isn't perfection, but demonstrating that the agent has robust, consistent resistance against known attack vectors.

Pillar 2: Reliability Simulation - Building Trust Through Consistency

While threat simulation focuses on preventing bad outcomes, reliability simulation focuses on ensuring good outcomes consistently. It evaluates whether agents behave predictably, maintain coherent goals, handle edge cases gracefully, and provide transparent, auditable decision-making. Our simulations cover the key dimensions that determine whether an agent can be trusted in production:

  • Goal Stability, which tests whether agents stay focused on their intended purpose or drift over long interactions;
  • Memory Integrity, evaluating whether agents correctly attribute information sources and maintain proper boundaries across sessions;
  • System Resilience, which ensures agents degrade gracefully under failures like API timeouts or malformed inputs rather than breaking down entirely;
  • Human Oversight, testing whether agents correctly flag actions that need human approval and respect overrides;
  • Observability, which measures whether an agent's decisions are transparent, loggable, and auditable; and
  • Governance, ensuring agents apply policies consistently and maintain clear accountability.

What makes reliability simulation particularly challenging is the inherent non-determinism of AI. Unlike threat simulation, where pass/fail is often binary, the same agent can produce different results each time it runs the same task. This is why we measure consistency rather than just capability. To capture this, we produce a Trust Score that tracks consistency across multiple trials, recovery time from failures, transparency of decision-making, and policy compliance. Unlike attack resistance scores, trust scores must be interpreted in context; a score that's acceptable for a research agent may be far too low for a financial trading agent where consistency is critical.

Simulation Depth

Both Threat and Reliability simulation come in three tiers, each designed for a different stage of an agent's journey.

The Baseline Probe is the quickest entry point taking roughly 15–30 minutes per agent and is designed for early testing and internal validation. It runs high-level checks to give an initial signal on whether the agent is obviously unsafe or behaving inconsistently, producing a simple pass or needs-hardening verdict.

The Adversarial Simulation steps up the intensity, taking 1–3 hours per agent, and is built for production-readiness. It runs deeper, multi-step scenarios from tool chaining abuse and memory poisoning on the threat side, to goal drift and failure recovery on the reliability side producing exploit success rates, blast radius estimates, consistency metrics, and actionable hardening insights.

The Regulatory-Grade Stress Test is the most comprehensive, running over 1–3 days in parallel. It's designed for auditors, regulators, and enterprise procurement teams, and covers exhaustive adversarial campaigns, cross-agent and cross-session attacks, cascading failure scenarios, and supply-chain risks. The output is evidence-grade findings with a regulator and auditor-ready report.

Why Both Pillars Matter: The Complementary Nature of Security and Trust

Security and reliability aren't separate concerns; they're deeply intertwined. An agent that's secure but unreliable will frustrate users and hurt adoption. An agent that's reliable but insecure will eventually be exploited. Security measures that are too aggressive can make agents unusable, while perfect behavior under normal conditions means nothing if that behavior can be hijacked by malicious actors. Both pillars need to be strong for an agent to succeed in production. This is why both threat and reliability simulation should be part of a continuous loop rather than a one-time check. Before deployment, we recommend running comprehensive simulations on every significant change. During deployment, we monitor production metrics that correlate with simulation performance. And after deployment, real-world incidents flow back into the simulation suite, a report of inconsistent behavior becomes a reliability test case, a newly discovered injection technique becomes a threat scenario. This feedback loop ensures our simulations stay relevant as both threats and requirements evolve.

Conclusion: Security and Trust as Foundations

Think of agent safety like the Swiss cheese model used in risk management: each layer of defense has holes, but when stacked together, they create a robust system where vulnerabilities rarely align. Threat simulation is your first layer: it identifies and patches security holes before adversaries find them. Reliability simulation is your second layer: it catches behavioral inconsistencies and edge cases that could lead to failures. Manual testing is your third layer: it validates real-world performance and surfaces issues that automated tests might miss.

No single layer is perfect. Threat evals won't catch every possible attack vector. Reliability simulations can't anticipate every edge case in production. Manual testing is limited by human time and imagination. But together, these three layers create overlapping defenses where the holes rarely align. An attack that slips through automated threat testing might be caught by a reliability check that detects unusual behavior. A subtle failure that passes reliability simulation might be spotted during manual testing. An edge case that humans miss might trigger an adversarial scenario.

CompFly AI’s Model of Agentic Trust
CompFly AI’s Model of Agentic Trust: Each has vulnerabilities (holes), but together they create overlapping defenses where holes rarely align. When one layer fails, others catch what slips through.

The path to production-ready AI agents requires more than demonstrating capability. Teams must systematically evaluate agents along two critical dimensions, then validate with human oversight. Threat simulation ensures agents resist adversarial attacks, producing an Attack Resistance Score that identifies and hardens specific vulnerabilities before they're exploited. Reliability simulation ensures agents behave consistently and transparently, producing a Trust Score that reveals which dimensions are strong and which need attention for a given use case. Manual testing provides the final validation layer, ensuring the agent performs as expected in real-world conditions with human judgment as the ultimate checkpoint.

As AI agents take on increasingly critical roles, these simulations become non-negotiable. The question isn't whether to implement them, but how quickly your team can build the infrastructure to deploy agents confidently. Every production incident should strengthen your simulations. Every new attack vector should expand your threat tests. Every report of inconsistent behavior should add to your reliability suite. The value compounds over time but only if you treat simulation as a core component, not an afterthought. The agents that succeed in production won't be the ones that demo the best. They'll be the ones tested most thoroughly against both adversarial attacks and real-world chaos.

AI agent evaluation is still an emerging, rapidly evolving discipline. As agents take on longer-horizon tasks, coordinate within multi-agent systems, and operate in more ambiguous or judgment-driven domains, today’s evaluation approaches will need to evolve. We’ll continue to refine our methods and share what we learn as the field matures.