Production AI agent eval harness: 40% fewer production incidents

Chief Technology Officer•Noble House Consulting•Jun 2025 — Mar 2026•Published 12 September 2025

Designed a three-layer resilience model and automated eval suite for AI agents handling recruiter workflows — cutting live incidents and rollback frequency.

Technology Stack

PythonpytestLangChainGrafanaPostgreSQLAzure

Key Outcomes

•40% reduction in production agent incidents within two quarters
•Weekly red-team and drift tests integrated into CI/CD
•Documented 6 failure modes with playbooks (published as Zenodo white paper)

Outcome: 40% fewer production incidents after eval-gated releases.

Context

Noble House ATS agents handled scheduling, screening nudges, and funnel updates. As volume grew, edge cases surfaced: hallucinated dates, wrong candidate merges, and silent tool failures.

Problem

No systematic evals — fixes were reactive.
Agents shared tools without circuit breakers.
On-call fatigue from rollback-heavy Fridays.

Three-layer resilience model

Input guardrails — schema validation, PII scrubbing, intent classification.
Runtime checks — tool timeouts, retry budgets, human escalation triggers.
Post-hoc evals — nightly replay of production traces against golden sets.

What we built

pytest + LangChain eval runners in CI.
Drift dashboard (accuracy, latency, cost/run).
Playbooks for six documented failure modes (see research page).

Results

Incidents down 40%; rollbacks dropped from weekly to monthly. Research published on Zenodo for the broader community.

CTA

Need production-grade agent evals? Book a 30-min call.