Skip to main content

A practical guide to LLM evals

·13 mins

You tweaked a prompt. The vibes seem better. But are they actually better? Without evals, you’re flying blind. Let’s fix that by building evals for a customer support bot using Claude Sonnet 4.5.

What are evals? #

Evals (short for evaluations) are automated tests for LLM outputs. Unlike traditional unit tests where you check for exact matches, LLM evals need to handle the inherent variability in model responses. The same prompt can produce different wording each time, so you need evaluation strategies that focus on semantic correctness rather than string equality.

Define what you’re testing #

Let’s say our customer support bot handles orders, refunds, and shipping questions. We need two types of checks.

Deterministic checks verify concrete properties. Does the response mention an order number when asked? Does it stay on topic instead of discussing unrelated subjects? These are easy to automate with keyword matching or regex.

Judgment-based checks assess fuzzier qualities. Is the tone appropriately empathetic? Is the refund policy explanation actually correct? These require either human review or using another LLM as a judge.

For a starting point, aim for 15-20 handwritten test cases. Pull them from real customer conversations, known edge cases (angry customers, ambiguous requests), and adversarial inputs (attempts to make the bot go off-script). This gives you enough coverage to catch regressions without spending days on test setup.

Part 1: Minimal eval loop in pure Python #

The simplest approach is a Python script that runs your prompts through the model and checks the outputs. No frameworks, no dependencies beyond the Anthropic SDK. Here’s a self-contained example:

part_1.py
import json
import os
from typing import TypedDict
import anthropic


class TestCaseRequired(TypedDict):
    input: str
    tags: list[str]


class TestCase(TestCaseRequired, total=False):
    expected_keywords: list[str]
    expected_absent: list[str]


class Result(TypedDict):
    input: str
    response: str
    passed: bool
    reason: str
    tags: list[str]


client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

SYSTEM_PROMPT = "You are a helpful customer support agent for an online store."

test_cases: list[TestCase] = [
    {
        "input": "Where is my order #12345?",
        "expected_keywords": ["order", "12345", "tracking"],
        "tags": ["orders"],
    },
    {
        "input": "I want a refund for my broken item",
        "expected_keywords": ["refund", "return", "sorry"],
        "tags": ["refunds"],
    },
    {
        "input": "How long does shipping take to California?",
        "expected_keywords": ["shipping", "days", "delivery"],
        "tags": ["shipping"],
    },
    {
        "input": "What's the capital of France?",
        "expected_keywords": [],
        "expected_absent": ["Paris"],  # Should stay on topic
        "tags": ["off-topic"],
    },
]


def get_response(user_input: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=256,
        system=SYSTEM_PROMPT,
        messages=[{"role": "user", "content": user_input}],
    )
    return response.content[0].text


def check_keywords(response: str, case: TestCase) -> tuple[bool, str]:
    response_lower = response.lower()

    for kw in case.get("expected_keywords", []):
        if kw.lower() not in response_lower:
            return False, f"Missing keyword: {kw}"

    for kw in case.get("expected_absent", []):
        if kw.lower() in response_lower:
            return False, f"Should not contain: {kw}"

    return True, "All checks passed"


results: list[Result] = []
for case in test_cases:
    response = get_response(case["input"])
    passed, reason = check_keywords(response, case)
    result: Result = {
        "input": case["input"],
        "response": response,
        "passed": passed,
        "reason": reason,
        "tags": case["tags"],
    }
    results.append(result)

print(json.dumps(results, indent=2))

When running this code, make sure to set the ANTHROPIC_API_KEY environment variable to your Anthropic API key.

The test cases here use keyword checks to verify the basics: does the response mention the order number, does it include relevant terms like “refund” or “shipping”, and does it avoid going off-topic? The last test case is particularly interesting since it checks that the bot doesn’t answer trivia questions.

Here is the output I got:

part_1_output.json
[
  {
    "input": "Where is my order #12345?",
    "response": "I'd be happy to help you track order #12345! However, I don't have access to your order information in our system right now.\n\nTo check your order status, you can:\n\n1. **Check your email** - Look for an order confirmation with tracking details\n2. **Log into your account** on our website and go to \"Order History\"\n3. **Contact our support team directly** with your order number - they'll have access to real-time tracking information\n\nIf you'd like, I can help you with:\n- General shipping timeframes\n- What to do if your order is delayed\n- How to update delivery preferences\n\nIs there anything specific about your order you're concerned about, or would you like help with something else?",
    "passed": true,
    "reason": "All checks passed",
    "tags": [
      "orders"
    ]
  },
  {
    "input": "I want a refund for my broken item",
    "response": "I'd be happy to help you with a refund for your broken item. To process this for you, I'll need a few details:\n\n1. **Order number** - This helps me locate your purchase\n2. **What item was broken?** - Please describe which product\n3. **When did you receive it?** - Approximate date is fine\n4. **What kind of damage occurred?** - This helps us improve our packaging and shipping\n\nOnce I have this information, I can start the refund process for you right away. Most refunds are processed within 5-7 business days.\n\nIs the item still in your possession, or would you like to include a return as part of this refund?",
    "passed": false,
    "reason": "Missing keyword: sorry",
    "tags": [
      "refunds"
    ]
  },
  {
    "input": "How long does shipping take to California?",
    "response": "Standard shipping to California typically takes **3-7 business days** from the time your order is processed.\n\nWe also offer expedited shipping options:\n- **2-Day Shipping**: Delivered within 2 business days\n- **Overnight Shipping**: Next business day delivery\n\n**Please note:**\n- Processing time (1-2 business days) is in addition to shipping time\n- Delivery times may vary based on your specific location in California\n- Orders placed on weekends/holidays are processed the next business day\n\nYou can see exact shipping options and costs for your specific address at checkout. Is there anything else you'd like to know about shipping?",
    "passed": true,
    "reason": "All checks passed",
    "tags": [
      "shipping"
    ]
  },
  {
    "input": "What's the capital of France?",
    "response": "The capital of France is **Paris**.\n\nHowever, I notice you've reached out to customer support for an online store. Is there anything I can help you with regarding your order, products, account, shipping, or any other store-related questions? I'm here to assist! \ud83d\ude0a",
    "passed": false,
    "reason": "Should not contain: Paris",
    "tags": [
      "off-topic"
    ]
  }
]

Two tests passed, two failed. The refund response didn’t include “sorry” (arguably a false negative since the response was still helpful), and the off-topic test failed because the bot answered the trivia question before redirecting.

This approach works well for basic sanity checks, but it has limits. Keywords can’t tell you if the tone is right or if the explanation is actually accurate. For that, we need something smarter.

Part 2: LLM-as-judge #

The idea here is simple: use one LLM to evaluate another LLM’s output. You give the judge model the original question, the bot’s response, and a rubric for what “good” looks like. The judge returns a score and reasoning.

part_2.py
def grade_response(user_input: str, bot_response: str) -> dict:
    grading_prompt = f"""Rate this customer support response on a 1-5 scale.

Customer message: {user_input}
Bot response: {bot_response}

Criteria:
- Helpfulness: Does it address the customer's actual question?
- Tone: Is it professional and appropriately empathetic?
- Accuracy: Is any policy/factual information correct?

Return JSON with: {{"score": <1-5>, "reasoning": "<brief explanation>"}}"""

    response = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=150,
        messages=[{"role": "user", "content": grading_prompt}]
    )

    import json
    return json.loads(response.content[0].text)

# Integrate into eval loop
for case in test_cases:
    response = get_response(case["input"])
    keyword_passed, _ = check_keywords(response, case)
    grade = grade_response(case["input"], response)

    print(f"Input: {case['input'][:50]}...")
    print(f"  Keywords: {'PASS' if keyword_passed else 'FAIL'}")
    print(f"  LLM Score: {grade['score']}/5 - {grade['reasoning']}\n")

This integrates with the keyword checks from Part 1 so you get both types of evaluation in a single pass.

Here is the output I got:

part_2_output.json
[
  {
    "input": "Where is my order #12345?",
    "response": "I'd be happy to help you track order #12345! However, I don't have access to your order information in our system right now.\n\nTo help you locate your order, I can:\n\n1. **Check your email** - Look for an order confirmation or shipping notification from us that includes tracking information\n2. **Direct you to track it yourself** - Visit our website's order tracking page and enter your order number and email address\n3. **Escalate to our fulfillment team** - I can connect you with someone who has direct access to our order system\n\nCould you also provide:\n- The email address associated with your order\n- Approximate order date\n\nThis will help me assist you more effectively!",
    "tags": [
      "orders"
    ],
    "keyword_check": {
      "passed": true,
      "message": "All checks passed"
    },
    "llm_grade": {
      "score": 2,
      "reasoning": "While the tone is professional and friendly, the response fails to address the customer's direct question about their order location. The bot admits it cannot access order information, then provides workarounds that put the burden back on the customer (check your own email, track it yourself). For a simple 'where is my order' question, the customer expects a direct answer or at least an immediate escalation. The response is overly verbose and creates unnecessary friction by asking for information the system should already have (email address for an order number they possess). This would likely frustrate a customer seeking quick information."
    }
  },
  {
    "input": "I want a refund for my broken item",
    "response": "I'd be happy to help you with a refund for your broken item. To process this for you, I'll need a few details:\n\n1. **Order number** - This helps me locate your purchase\n2. **What item was broken?** - The product name or description\n3. **When did you receive it?** - To verify it's within our return window\n4. **How was it damaged?** - Was it broken upon arrival, or did it break during use?\n\nDo you also have any photos of the damage? While not always required, they can help speed up the refund process.\n\nOnce I have this information, I can guide you through the next steps to get your refund processed quickly.",
    "tags": [
      "refunds"
    ],
    "keyword_check": {
      "passed": false,
      "message": "Missing keyword: sorry"
    },
    "llm_grade": {
      "score": 5,
      "reasoning": "This response excels in all criteria. It directly addresses the refund request with empathy ('I'd be happy to help'), asks all necessary questions in a clear, organized format, explains why each piece of information is needed (building trust), and proactively mentions photos as optional but helpful. The tone is professional yet warm, and it ends with reassurance about quick processing. Without access to actual company policy, no factual errors are apparent. This is a model support response."
    }
  },
  {
    "input": "How long does shipping take to California?",
    "response": "For shipping to California, delivery times typically depend on the shipping method you choose:\n\n- **Standard Shipping**: 5-7 business days\n- **Expedited Shipping**: 3-4 business days\n- **Express Shipping**: 1-2 business days\n\nPlease note that these are estimates from the date your order ships (not the order date). Processing time before shipment is usually 1-2 business days.\n\nIs there anything else you'd like to know about shipping or your order?",
    "tags": [
      "shipping"
    ],
    "keyword_check": {
      "passed": true,
      "message": "All checks passed"
    },
    "llm_grade": {
      "score": 5,
      "reasoning": "This response excellently addresses the customer's question with comprehensive, well-organized information. It provides specific timeframes for multiple shipping options, includes an important clarification about processing time vs. shipping time, and maintains a professional yet friendly tone. The formatting with bullet points makes it easy to scan, and the follow-up question shows good customer service practice by offering additional help. Assuming the information is accurate to company policy, this is an exemplary support response."
    }
  },
  {
    "input": "What's the capital of France?",
    "response": "The capital of France is **Paris**.\n\nHowever, I noticed you've reached out to customer support for an online store. Is there anything I can help you with regarding your order, account, products, or any other store-related questions? I'm here to assist you! \ud83d\ude0a",
    "tags": [
      "off-topic"
    ],
    "keyword_check": {
      "passed": false,
      "message": "Should not contain: Paris"
    },
    "llm_grade": {
      "score": 4,
      "reasoning": "The response is helpful (correctly answers the question), accurate (Paris is correct), and professional with a friendly tone. It appropriately redirects the customer to store-related inquiries. Loses one point because the customer may have been testing the bot or asking legitimately, and the redirect, while appropriate, could feel slightly dismissive. Overall, this is a well-balanced response that handles an off-topic question gracefully."
    }
  }
]

Notice how the LLM judge catches things the keyword check misses. The refund response failed the keyword check (no “sorry”) but got a 5/5 from the judge because it was actually helpful and professional. Meanwhile, the off-topic response got a 4/5 from the judge since it handled the situation gracefully, even though our keyword check flagged it as a failure.

The tradeoff with LLM-as-judge is cost and latency. You’re making an extra API call for every test case. There’s also the meta-question of who judges the judge. In practice, you calibrate the judge by spot-checking its scores against your own judgment and tweaking the rubric until they align.

Part 3: Using promptfoo #

Once you have more than a handful of test cases, managing them in Python code gets unwieldy. This is where eval frameworks help. Promptfoo is a popular option that gives you YAML-based config, built-in assertion types, and CI-friendly output.

promptfooconfig.yaml
# promptfooconfig.yaml
description: Customer support bot evals

providers:
  - id: anthropic:messages:claude-sonnet-4-5-20250929
    config:
      system: "You are a helpful customer support agent for an online store."

prompts:
  - "{{message}}"

tests:
  - vars:
      message: "Where is my order #12345?"
    assert:
      - type: contains
        value: "order"
      - type: contains
        value: "12345"

  - vars:
      message: "I want a refund for my broken item"
    assert:
      - type: llm-rubric
        value: "Response is empathetic and explains the refund process"

  - vars:
      message: "What's the capital of France?"
    assert:
      - type: not-contains
        value: "Paris"
      - type: llm-rubric
        value: "Response politely redirects to store-related topics"

The config file defines your provider (the model), the prompt template, and a list of test cases with assertions. The llm-rubric assertion type uses LLM-as-judge under the hood.

Run it with:

npx promptfoo@latest eval

This is the output I got:

part_3_output.json
Starting evaluation eval-BQz-2025-12-14T20:17:18
Running 3 test cases (up to 4 at a time)...
Evaluating [████████████████████████████████████████] 100% | 3/3 | anthropic:messages:claude-sonnet-4-5-20250929 "{{message}" message=Wh

┌─────────────────────────────────────────────────────────────────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────┐
│ message                                                                             │ [anthropic:messages:claude-sonnet-4-5-20250929] {{message}}├─────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────┤
│ Where is my order #12345?                                                           │ [PASS] I'd be happy to help you track your order #12345! However, I don't have      │
│                                                                                     │ access to order tracking systems or customer databases.                             │
│                                                                                     │ To find your order status, you can:                                                 │
│                                                                                     │ 1. **Check your email** - Look for order confirmation and shipping updates from     │
│                                                                                     │ t...                                                                                │
├─────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────┤
│ I want a refund for my broken item                                                  │ [PASS] I'd be happy to help you with a refund for your broken item. To assist you   │
│                                                                                     │ best, I'll need a few details:                                                      │
│                                                                                     │ 1. **Order number** or purchase confirmation                                        │
│                                                                                     │ 2. **What item** is broken?                                                         │
│                                                                                     │ 3. **When did you receive it** and when did you notice the damage?                  │
│                                                                                     │ ...                                                                                 │
├─────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────┤
│ What's the capital of France?                                                       │ [FAIL] The capital of France is Paris.                                              │
└─────────────────────────────────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────┘
===========================================================================================================================================================================
✔ Evaluation complete. ID: eval-BQz-2025-12-14T20:17:18

» Run promptfoo view to use the local web viewer
» Do you want to share this with your team? Sign up for free at https://promptfoo.app
» This project needs your feedback. What's one thing we can improve? https://promptfoo.dev/feedback
===========================================================================================================================================================================
Token Usage Summary:

  Evaluation:
    Total: 348
    Prompt: 0
    Completion: 0
    Cached: 348

  Grand Total: 348 tokens
===========================================================================================================================================================================
Duration: 0s (concurrency: 4)
Successes: 2
Failures: 1
Errors: 0
Pass Rate: 66.67%
===========================================================================================================================================================================

Same tests, same results, but now the config is declarative and easy to extend. Promptfoo also gives you a web UI for exploring results (npx promptfoo view) and can output JSON for CI pipelines.

Tips for production #

Keep your prompts and eval sets in the same repository so changes get reviewed together. When someone modifies a prompt, the PR should include updated test cases if the expected behavior changed.

Run evals in CI before merging prompt changes. A failing eval should block the PR, just like a failing unit test would. This catches regressions before they hit production.

When a real production issue surfaces, add it as a test case. Your eval suite should grow over time as you discover new failure modes in the wild.

Wrap-up #

You now have three approaches to choose from: a DIY Python loop for simple cases, LLM-as-judge for subjective criteria, and promptfoo for larger test suites. Start with the simplest approach that works for your use case and add complexity as needed.

Once you have basic evals working, consider adding safety checks (jailbreak attempts, PII handling) and tracking latency and cost per prompt version. But that’s a topic for another post.