All research
White Paper (Zenodo)
Published on Zenodo
1 September 2025
AI Agent Failure Modes in Production Systems
Production AI agents fail in predictable ways — not because models are weak, but because orchestration, tools, and guardrails are under-designed. This white paper documents six failure modes observed across HRTech and HealthTech deployments, and proposes a three-layer resilience model: input guardrails, runtime circuit breakers, and post-hoc evaluation replay.
Key takeaways
- •Six recurring failure modes: tool timeout cascades, silent hallucination, context bleed, wrong-entity merges, prompt injection, and drift without detection.
- •A three-layer model — guardrails, runtime checks, eval replay — reduced incidents 40% in a live ATS agent deployment.
- •Eval harnesses belong in CI/CD, not slide decks; golden traces beat synthetic-only tests.
- •Human escalation paths must be first-class, not bolted on after an incident.
Abstract
A practitioner's framework for building resilient AI agent systems — covering 6 documented failure modes and a 3-layer resilience model for production deployments. Based on real-world observations across HRTech and HealthTech platforms.