AI Evals, Without the Jargon: How We’re Keeping Our Agents Honest at our Agency
6 min read

When we started rolling out voice and chat agents at Hillflare, it felt like opening a thousand tabs at once. Every new client, every script, every accent, every “quick tweak” to a prompt multiplied the ways things could go right… or sideways. Reading random transcripts and saying “looks good” wasn’t going to scale. We needed a way to prove our agents behaved the way we expect—today, tomorrow, and after the next model/prompt/tooling change.
That’s where evals come in. Here’s the simple version and exactly how we’re using them at Hillflare.
What’s an “eval,” in plain language?
An eval is a repeatable check that measures whether an AI system behaves the way your team expects. Think of it like a regression test + report card for your agent:
We encode expectations (what “good” looks like) into datasets, prompts, rubrics, or automated tests.
We rerun those checks whenever anything changes—model version, prompt, RAG settings, tools, guardrails.
Instead of “I skimmed a few calls and vibes are positive,” we get: “Pass rate is 92% on 250 representative cases; tone/compliance stayed flat; accuracy improved +7% week-over-week.”
Why this matters (right now)
We ship fast. Prompts and configs change weekly (sometimes daily). Evals catch regressions before customers do.
Leaders want proof, not vibes. Scored examples and trend charts move conversations with PMs/CFOs from subjective to objective.
Safety & compliance expectations are rising. For regulated or sensitive use cases, a well-instrumented eval story is table stakes.
What a useful eval looks like (The Hillflare Way)
A representative dataset Ours comes from real life: Spanish/English calls, WhatsApp threads, poor connections, regional slang, edge cases like “I’m allergic to X” or “Can I pay in installments?” If your dataset doesn’t look like production, your scores are fantasy.
Clear success criteria We define “good” across dimensions like Accuracy, Tone, Compliance/Policy, Task Completion, and Escalation Hygiene (did the agent hand off smoothly?). Ambiguity in the rubric = noisy results.
Scoring mechanics that mirror reality
Deterministic checks for things like required disclaimers, price formats, phone number validation.
Human reviewers to anchor what “good” actually is.
LLM judges (grading prompts) only after we calibrate them to agree with trusted human scores.
Feedback loops We log every run, push to a simple dashboard, and review low-scoring traces in weekly “tape review.” When customer behavior shifts (new promos, new scripts), we refresh the dataset and update the rubric.
How evals show up at each stage
1) Research & Design
Early on, we sample 10–20 transcripts a week. This surfaces brittle prompts (“agent panics when user code-switches”), missing context (“no pricing for the new package”), and unexpected questions. These learnings go straight into our product notes and onboarding for reviewers.
2) Development & Tuning
As prompts/models/toolchains evolve, we run evals on a golden set. Passing the suite becomes a gate to merge or deploy. We A/B candidate prompts against the same set to pick winners on data, not gut.
3) Launch & Guardrails
Before a release, we define thresholds (e.g., ≥95% accuracy, ≤X% improper refusals, no regressions vs baseline). After launch, online scoring and user feedback stream back into the dashboard. If a metric dips, alerts fire and we investigate.
4) Continuous Improvement
We version datasets, rubrics, and scoring scripts in Git. Low-scoring production traces feed an annotation queue to keep the golden set fresh. Over time, this flywheel keeps agents aligned with the business.
Build your first eval (our quick-start checklist)
Sample real conversations (10–50). Mark where the agent nailed it, missed context, or broke policy.
Draft a lightweight rubric. Two columns: Success criteria and What “fail” looks like, plus a few examples.
Do a manual scoring pass. Have your PM/SME rate each item; capture notes next to scores.
Automate the obvious. Regex/structured checks for disclaimers, formats, tool usage, escalation phrases.
Add an LLM judge. Encode the rubric in a grading prompt and tweak until the model agrees with humans on a small calibration set.
Version everything. Put the dataset, rubric, and scoring scripts in source control. Rerun on every meaningful change.
Minimal rubric (copy/paste this table into Notion)
dimension | weight | pass_criteria | fail_signals | examples_pass | examples_fail |
Accuracy | 0.35 | All facts/steps/prices correct and complete | Wrong price, missing steps, contradicts policy | Quotes $1,299 + IVA; confirms 3 steps | Says $999; skips a required step |
Tone | 0.15 | Warm, concise, on-brand; code-switching handled well | Overpromising, robotic, ignores user tone | “Con gusto te apoyo. Puedo agendarte hoy a las 4.” | Overly salesy or curt |
Compliance/Policy | 0.25 | Required disclaimers; no restricted claims | Missing disclaimer; restricted statements | Reads exact disclaimer when required | Suggests outcomes not allowed |
Task completion | 0.15 | User achieves goal (book/qualify/route/answer) | Dead-end, partial, or wrong outcome | Captures all mandatory fields and confirms | Fails to confirm key details |
Escalation hygiene | 0.10 | Hands off at right time with proper summary/context | Late/early escalation; missing context | “Transfiero con (Agente), resumen: …” | Escalates without context |
Scoring: 0–3 per dimension. Weight by business risk; adjust totals to 1.0.
Judge prompt (skeleton)
You are grading an agent response for {dimension}.
Rubric: {rubric_text}
Conversation: {transcript}
Agent reply: {reply}
Return strict JSON:
{"score": 0-3, "justification": "...", "fail_signals": ["..."]}
A tiny, concrete example (from our world)
Use case: Voice agent qualifying a lead for a clinic.
Dataset: 60 call snippets covering accents, background noise, and code-switching.
Success criteria:
Gets all mandatory fields (symptom, contraindications, preferred time).
Uses brand tone and reads the exact disclaimer.
Escalates to human when the user mentions a red-flag condition.
Checks:
Deterministic: disclaimer exact-match; date/time format; phone normalization.
Human: “Would you trust this agent with your own booking?” (Likert)
LLM judge: rubric-encoded grading with examples for pass/fail.
Outcome → business impact: After improving Task completion from 2.2 → 2.8/3, booking rate rose 6% and AHT fell 11% the following week.
Tie evals to business metrics
Booking/Conversion rate (for sales/appointment flows)
Average handle time (AHT) and deflection rate (for support)
CSAT/NPS vs human baseline
Cost-to-serve (model runtime + human review time)
When a KPI dips, eval traces tell you whether model behavior regressed or external context shifted (new pricing, new script, seasonal demand).
Common pitfalls (we’ve hit these)
One-metric myopia. “Accuracy 95%” hides tone/compliance issues. Score multiple dimensions.
Stale golden sets. Old examples ignore new workflows and policies. Refresh often.
Judge drift. LLM judges wander. Recalibrate against human reviews periodically.
Treating evals as a project. This is a capability, not a one-off checklist.
The road ahead
As AI moves deeper into real customer journeys, evals become the operating system for responsible shipping. They give engineers confidence to move faster, help PMs defend roadmaps, and give executives assurance that innovation isn’t outrunning safety.
Start small. Keep it grounded in your user traces. Let the discipline compound. Your future releases (and your customers) will thank you.