40% ↓ incidents
Production AI agent eval harness: 40% fewer production incidents
Chief Technology Officer•Noble House Consulting•Jun 2025 — Mar 2026•Published 12 September 2025
Designed a three-layer resilience model and automated eval suite for AI agents handling recruiter workflows — cutting live incidents and rollback frequency.
Technology Stack
PythonpytestLangChainGrafanaPostgreSQLAzure
Key Outcomes
- •40% reduction in production agent incidents within two quarters
- •Weekly red-team and drift tests integrated into CI/CD
- •Documented 6 failure modes with playbooks (published as Zenodo white paper)
Outcome: 40% fewer production incidents after eval-gated releases.
Context
Noble House ATS agents handled scheduling, screening nudges, and funnel updates. As volume grew, edge cases surfaced: hallucinated dates, wrong candidate merges, and silent tool failures.
Problem
- No systematic evals — fixes were reactive.
- Agents shared tools without circuit breakers.
- On-call fatigue from rollback-heavy Fridays.
Three-layer resilience model
- Input guardrails — schema validation, PII scrubbing, intent classification.
- Runtime checks — tool timeouts, retry budgets, human escalation triggers.
- Post-hoc evals — nightly replay of production traces against golden sets.
What we built
- pytest + LangChain eval runners in CI.
- Drift dashboard (accuracy, latency, cost/run).
- Playbooks for six documented failure modes (see research page).
Results
Incidents down 40%; rollbacks dropped from weekly to monthly. Research published on Zenodo for the broader community.
CTA
Need production-grade agent evals? Book a 30-min call.