AI Evals, Without the Jargon: How We’re Keeping Our Agents Honest at our Agency

·Sep 19, 2025·

6 min read

Cover Image for AI Evals, Without the Jargon: How We’re Keeping Our Agents Honest at our Agency

When we started rolling out voice and chat agents at Hillflare, it felt like opening a thousand tabs at once. Every new client, every script, every accent, every “quick tweak” to a prompt multiplied the ways things could go right… or sideways. Reading random transcripts and saying “looks good” wasn’t going to scale. We needed a way to prove our agents behaved the way we expect—today, tomorrow, and after the next model/prompt/tooling change.

That’s where evals come in. Here’s the simple version and exactly how we’re using them at Hillflare.

What’s an “eval,” in plain language?

An eval is a repeatable check that measures whether an AI system behaves the way your team expects. Think of it like a regression test + report card for your agent:

We encode expectations (what “good” looks like) into datasets, prompts, rubrics, or automated tests.
We rerun those checks whenever anything changes—model version, prompt, RAG settings, tools, guardrails.

Instead of “I skimmed a few calls and vibes are positive,” we get: “Pass rate is 92% on 250 representative cases; tone/compliance stayed flat; accuracy improved +7% week-over-week.”

Why this matters (right now)

We ship fast. Prompts and configs change weekly (sometimes daily). Evals catch regressions before customers do.
Leaders want proof, not vibes. Scored examples and trend charts move conversations with PMs/CFOs from subjective to objective.
Safety & compliance expectations are rising. For regulated or sensitive use cases, a well-instrumented eval story is table stakes.

What a useful eval looks like (The Hillflare Way)

A representative dataset Ours comes from real life: Spanish/English calls, WhatsApp threads, poor connections, regional slang, edge cases like “I’m allergic to X” or “Can I pay in installments?” If your dataset doesn’t look like production, your scores are fantasy.
Clear success criteria We define “good” across dimensions like Accuracy, Tone, Compliance/Policy, Task Completion, and Escalation Hygiene (did the agent hand off smoothly?). Ambiguity in the rubric = noisy results.
Scoring mechanics that mirror reality
- Deterministic checks for things like required disclaimers, price formats, phone number validation.
- Human reviewers to anchor what “good” actually is.
- LLM judges (grading prompts) only after we calibrate them to agree with trusted human scores.
Feedback loops We log every run, push to a simple dashboard, and review low-scoring traces in weekly “tape review.” When customer behavior shifts (new promos, new scripts), we refresh the dataset and update the rubric.

How evals show up at each stage

1) Research & Design

Early on, we sample 10–20 transcripts a week. This surfaces brittle prompts (“agent panics when user code-switches”), missing context (“no pricing for the new package”), and unexpected questions. These learnings go straight into our product notes and onboarding for reviewers.

2) Development & Tuning

As prompts/models/toolchains evolve, we run evals on a golden set. Passing the suite becomes a gate to merge or deploy. We A/B candidate prompts against the same set to pick winners on data, not gut.

3) Launch & Guardrails

Before a release, we define thresholds (e.g., ≥95% accuracy, ≤X% improper refusals, no regressions vs baseline). After launch, online scoring and user feedback stream back into the dashboard. If a metric dips, alerts fire and we investigate.

4) Continuous Improvement

We version datasets, rubrics, and scoring scripts in Git. Low-scoring production traces feed an annotation queue to keep the golden set fresh. Over time, this flywheel keeps agents aligned with the business.

Build your first eval (our quick-start checklist)

Sample real conversations (10–50). Mark where the agent nailed it, missed context, or broke policy.
Draft a lightweight rubric. Two columns: Success criteria and What “fail” looks like, plus a few examples.
Do a manual scoring pass. Have your PM/SME rate each item; capture notes next to scores.
Automate the obvious. Regex/structured checks for disclaimers, formats, tool usage, escalation phrases.
Add an LLM judge. Encode the rubric in a grading prompt and tweak until the model agrees with humans on a small calibration set.
Version everything. Put the dataset, rubric, and scoring scripts in source control. Rerun on every meaningful change.

Minimal rubric (copy/paste this table into Notion)

dimension	weight	pass_criteria	fail_signals	examples_pass	examples_fail
Accuracy	0.35	All facts/steps/prices correct and complete	Wrong price, missing steps, contradicts policy	Quotes $1,299 + IVA; confirms 3 steps	Says $999; skips a required step
Tone	0.15	Warm, concise, on-brand; code-switching handled well	Overpromising, robotic, ignores user tone	“Con gusto te apoyo. Puedo agendarte hoy a las 4.”	Overly salesy or curt
Compliance/Policy	0.25	Required disclaimers; no restricted claims	Missing disclaimer; restricted statements	Reads exact disclaimer when required	Suggests outcomes not allowed
Task completion	0.15	User achieves goal (book/qualify/route/answer)	Dead-end, partial, or wrong outcome	Captures all mandatory fields and confirms	Fails to confirm key details
Escalation hygiene	0.10	Hands off at right time with proper summary/context	Late/early escalation; missing context	“Transfiero con (Agente), resumen: …”	Escalates without context

Scoring: 0–3 per dimension. Weight by business risk; adjust totals to 1.0.

Judge prompt (skeleton)

You are grading an agent response for {dimension}.
Rubric: {rubric_text}
Conversation: {transcript}
Agent reply: {reply}
Return strict JSON:
{"score": 0-3, "justification": "...", "fail_signals": ["..."]}

A tiny, concrete example (from our world)

Use case: Voice agent qualifying a lead for a clinic.

Dataset: 60 call snippets covering accents, background noise, and code-switching.
Success criteria:
- Gets all mandatory fields (symptom, contraindications, preferred time).
- Uses brand tone and reads the exact disclaimer.
- Escalates to human when the user mentions a red-flag condition.
Checks:
- Deterministic: disclaimer exact-match; date/time format; phone normalization.
- Human: “Would you trust this agent with your own booking?” (Likert)
- LLM judge: rubric-encoded grading with examples for pass/fail.
Outcome → business impact: After improving Task completion from 2.2 → 2.8/3, booking rate rose 6% and AHT fell 11% the following week.

Tie evals to business metrics

Booking/Conversion rate (for sales/appointment flows)
Average handle time (AHT) and deflection rate (for support)
CSAT/NPS vs human baseline
Cost-to-serve (model runtime + human review time)

When a KPI dips, eval traces tell you whether model behavior regressed or external context shifted (new pricing, new script, seasonal demand).

Common pitfalls (we’ve hit these)

One-metric myopia. “Accuracy 95%” hides tone/compliance issues. Score multiple dimensions.
Stale golden sets. Old examples ignore new workflows and policies. Refresh often.
Judge drift. LLM judges wander. Recalibrate against human reviews periodically.
Treating evals as a project. This is a capability, not a one-off checklist.

The road ahead

As AI moves deeper into real customer journeys, evals become the operating system for responsible shipping. They give engineers confidence to move faster, help PMs defend roadmaps, and give executives assurance that innovation isn’t outrunning safety.

Start small. Keep it grounded in your user traces. Let the discipline compound. Your future releases (and your customers) will thank you.

Héctor Arriola