DS
Diwesh Saxena
40% ↓ incidents

Production AI agent eval harness: 40% fewer production incidents

Chief Technology OfficerNoble House ConsultingJun 2025 — Mar 2026Published 12 September 2025

Designed a three-layer resilience model and automated eval suite for AI agents handling recruiter workflows — cutting live incidents and rollback frequency.

Technology Stack

PythonpytestLangChainGrafanaPostgreSQLAzure

Key Outcomes

  • 40% reduction in production agent incidents within two quarters
  • Weekly red-team and drift tests integrated into CI/CD
  • Documented 6 failure modes with playbooks (published as Zenodo white paper)
Outcome: 40% fewer production incidents after eval-gated releases.

Context

Noble House ATS agents handled scheduling, screening nudges, and funnel updates. As volume grew, edge cases surfaced: hallucinated dates, wrong candidate merges, and silent tool failures.

Problem

  • No systematic evals — fixes were reactive.
  • Agents shared tools without circuit breakers.
  • On-call fatigue from rollback-heavy Fridays.

Three-layer resilience model

  1. Input guardrails — schema validation, PII scrubbing, intent classification.
  2. Runtime checks — tool timeouts, retry budgets, human escalation triggers.
  3. Post-hoc evals — nightly replay of production traces against golden sets.

What we built

  • pytest + LangChain eval runners in CI.
  • Drift dashboard (accuracy, latency, cost/run).
  • Playbooks for six documented failure modes (see research page).

Results

Incidents down 40%; rollbacks dropped from weekly to monthly. Research published on Zenodo for the broader community.

CTA

Need production-grade agent evals? Book a 30-min call.