Skip to content

Writing Evaluations

This guide explains how to create new evaluations for HolmesGPT. Evaluations test the system's ability to correctly diagnose issues and provide accurate recommendations.

Overview

HolmesGPT supports two types of evaluations:

  1. Ask Holmes Tests: Chat-like interactions (tests/llm/test_ask_holmes.py)
  2. Investigation Tests: Issue analysis for events triggered by AlertManager (tests/llm/test_investigate.py)

Each test consists of: - A test case definition (test_case.yaml) - Mock tool outputs (e.g., kubectl_describe.txt) - Optional Kubernetes manifests for live testing - Optional custom toolset configurations

High-Level Steps

  1. Choose test type: Ask Holmes vs Investigation. Choose Ask Holmes for most use cases. Choose Investigations for issues triggered by AlertManager
  2. Create a test folder: Use numbered naming convention
  3. Define your test case:
  4. Create test_case.yaml with prompt and expectations
  5. Define kubectl or helm setup and teardown manifests
  6. Generate mock data: Using a live system
  7. Set evaluation criteria: Define minimum scores for test success
  8. Test and iterate: Run the test and refine as needed

Step-by-Step Example: Creating an Ask Holmes Test

Let's create a simple test that asks about pod health status.

Step 1: Create Test Folder

mkdir tests/llm/fixtures/test_ask_holmes/99_pod_health_check
cd tests/llm/fixtures/test_ask_holmes/99_pod_health_check

Step 2: Create test_case.yaml

user_prompt: 'Is the nginx pod healthy?'
expected_output:
  - nginx pod is healthy
evaluation:
  correctness: 1
before_test: kubectl apply -f ./manifest.yaml
after_test: kubectl delete -f ./manifest.yaml
  • user_prompt: This is the question that will trigger Holmes' investigation
  • expected_output: This is a list of expected elements that MUST be found in Holmes' answer. The combination of these elements lead to a correctness score based on HolmesGPT's output. This expected_output will be compared against HolmesGPT's answer and evaluated by a LLM ('LLM as judge'). The resulting score is called correctness and is a binary score with a value of either 0 or 1. HolmesGPT's answer is score 0 is any of the expected element is not present in the answer, 1 if all expected elements are preent in the answer.
  • evaluation.correctness: This is the expected correctness score and is used for pytest to fail the test. This expected correctness score should be 0 unless you expect HolmesGPT to systematically succeed the evaluation. Because of this, it is important for expected_output to be reduced to the minimally accepted output from HolmesGPT.
  • before_test and after_test: These are setup and teardown steps to reproduce the test on a fresh environment. It is important for these to be present because as HolmesGPT's code, prompt and toolset evolve the mocks become insufficient or inaccurate. These scripts are run automatically when the env var RUN_LIVE=true is set

Step 3: Generate Mock Tool Outputs

Create mock files that simulate kubectl command outputs.

The best way to do this is to:

  1. Deploy the test case you want to build an eval for in a kubernetes cluster (run the before_test script manually)
  2. Configure HolmesGPT to connect to the cluster (via kubectl and any other relevant toolsets)
  3. Enable the auto generation of mock files by setting generate_mocks: True in your test_case.yaml
  4. Repeatedly run the eval with ITERATIONS=100 pytest tests/llm/test_ask_holmes.py -k 99_pod_health_check
  5. Removing the prefix .AUTOGENERATED from all autogenerated files

Step 4: Run the Test

pytest ./tests/llm/test_ask_holmes.py -k "99_pod_health_check" -v

Test Case Configuration Reference

Common Fields

Field Type Required Description
user_prompt string Yes The question or prompt for HolmesGPT
expected_output string or list Yes Expected elements in the response
evaluation dict No Minimum scores for test to pass

Ask Holmes Specific Fields

Field Type Description
before_test string Command to run before test (requires RUN_LIVE=true)
after_test string Command to run after test cleanup
generate_mocks boolean Whether to generate new mock files

Investigation Specific Fields

Field Type Description
expected_sections dict Required/prohibited sections in output

Example: Complex Investigation Test

user_prompt: "Investigate this CrashLoopBackOff issue"
expected_output:
  - Pod is experiencing CrashLoopBackOff
  - Container exits with code 1 due to configuration error
  - Missing environment variable DATABASE_URL
expected_sections:
  "Root Cause Analysis":
    - CrashLoopBackOff
    - configuration error
  "Recommended Actions": true
  "External Links": false
evaluation:
  correctness: 0
before_test: kubectl apply -f ./manifest.yaml
after_test: kubectl delete -f ./manifest.yaml

Mock File Generation

Automatic Generation

Set generate_mocks: true in test_case.yaml and run with a live cluster:

ITERATIONS=100 pytest ./tests/llm/test_ask_holmes.py -k "your_test"

This captures real tool outputs and saves them as mock files.

Manual Creation

Create files matching the tool names used by HolmesGPT:

  • kubectl_describe.txt - Pod/resource descriptions
  • kubectl_logs.txt - Container logs
  • kubectl_events.txt - Kubernetes events
  • prometheus_query.txt - Metrics data
  • fetch_loki_logs.txt - Log aggregation results

Naming Convention

Mock files follow the pattern: {tool_name}_{additional_context}.txt

Examples: - kubectl_describe_pod_nginx_default.txt - kubectl_logs_all_containers_nginx.txt - execute_prometheus_range_query.txt

Toolset Configuration

Some tests require specific toolsets. Create a toolsets.yaml file:

toolsets:
  - name: kubernetes
    enabled: true
  - name: prometheus
    enabled: true
    config:
      prometheus_url: http://localhost:9090 # requires port-forward
  - name: grafana_loki
    enabled: true
    config:
      base_url: http://localhost:3000 # requires port-forward
      api_key: "{{env.GRAFANA_API_KEY}}"
      grafana_datasource_uid: "{{env.GRAFANA_DATASOURCE_UID}}"

Live Testing with Real Resources

For tests that need actual Kubernetes resources:

Step 1: Create Manifest

manifest.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test-nginx
  template:
    metadata:
      labels:
        app: test-nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.20
        ports:
        - containerPort: 80

Step 2: Configure Setup/Teardown

user_prompt: 'How is the test-nginx deployment performing?'
before-test: kubectl apply -f manifest.yaml
after-test: kubectl delete -f manifest.yaml
# ... rest of configuration

Step 3: Run Live Test

RUN_LIVE=true pytest ./tests/llm/test_ask_holmes.py -k "your_test"

RUN_LIVE is currently incompatible with ITERATIONS > 1.

Evaluation Scoring

Correctness Score

Measures how well the output matches expected elements: - 1: Match - 0: Mismatch

Setting Minimum Scores

evaluation:
  correctness: 1

Best Practices

Test Design

  1. Start simple: Begin with basic scenarios before complex edge cases
  2. Clear expectations: Write specific, measurable expected outputs
  3. Realistic scenarios: Base tests on actual user problems
  4. Incremental complexity: Build from simple to complex test cases

Mock Data Quality

  1. Representative data: Use realistic kubectl outputs and logs
  2. Error scenarios: Include failure modes and edge cases
  3. Consistent formatting: Match actual tool output formats
  4. Sufficient detail: Include enough information for proper diagnosis
  5. Run repeatedly: Run mock generation many times to ensure all investigative paths are covered by mock files

Troubleshooting Test Creation

Common Issues

Test always fails with low correctness score: - Check if expected_output matches actual LLM capabilities - Verify mock data provides sufficient information - Consider lowering score threshold temporarily

Missing tool outputs: - Ensure mock files are named correctly - Check that required toolsets are enabled - Verify mock file content is properly formatted

Inconsistent results: - Run multiple iterations: ITERATIONS=5 - Check for non-deterministic elements in prompts - Consider using temperature=0 for more consistent outputs

Debugging Commands

# Verbose output showing all details
pytest -v -s ./tests/llm/test_ask_holmes.py -k "your_test"

# Generate fresh mocks from live system
# set `generate_mocks: True` in test_case.yaml` and then:
pytest ./tests/llm/test_ask_holmes.py -k "your_test"

This completes the evaluation writing guide. The next step is setting up reporting and analysis using Braintrust.