Writing Evaluations¶
This guide explains how to create new evaluations for HolmesGPT. Evaluations test the system's ability to correctly diagnose issues and provide accurate recommendations.
- Evaluations Overview - Introduction to HolmesGPT's evaluation system
- Reporting with Braintrust - Analyze results and debug failures using Braintrust
Overview¶
HolmesGPT supports two types of evaluations:
- Ask Holmes Tests: Chat-like interactions (
tests/llm/test_ask_holmes.py
) - Investigation Tests: Issue analysis for events triggered by AlertManager (
tests/llm/test_investigate.py
)
Each test consists of:
- A test case definition (test_case.yaml
)
- Mock tool outputs (e.g., kubectl_describe.txt
)
- Optional Kubernetes manifests for live testing
- Optional custom toolset configurations
High-Level Steps¶
- Choose test type: Ask Holmes vs Investigation. Choose Ask Holmes for most use cases. Choose Investigations for issues triggered by AlertManager
- Create a test folder: Use numbered naming convention
- Define your test case:
- Create
test_case.yaml
with prompt and expectations - Define kubectl or helm setup and teardown manifests
- Generate mock data: Using a live system
- Set evaluation criteria: Define minimum scores for test success
- Test and iterate: Run the test and refine as needed
Step-by-Step Example: Creating an Ask Holmes Test¶
Let's create a simple test that asks about pod health status.
Step 1: Create Test Folder¶
mkdir tests/llm/fixtures/test_ask_holmes/99_pod_health_check
cd tests/llm/fixtures/test_ask_holmes/99_pod_health_check
Step 2: Create test_case.yaml¶
user_prompt: 'Is the nginx pod healthy?'
expected_output:
- nginx pod is healthy
evaluation:
correctness: 1
before_test: kubectl apply -f ./manifest.yaml
after_test: kubectl delete -f ./manifest.yaml
user_prompt
: This is the question that will trigger Holmes' investigationexpected_output
: This is a list of expected elements that MUST be found in Holmes' answer. The combination of these elements lead to acorrectness
score based on HolmesGPT's output. Thisexpected_output
will be compared against HolmesGPT's answer and evaluated by a LLM ('LLM as judge'). The resulting score is calledcorrectness
and is a binary score with a value of either0
or1
. HolmesGPT's answer is score0
is any of the expected element is not present in the answer,1
if all expected elements are preent in the answer.evaluation.correctness
: This is the expected correctness score and is used for pytest to fail the test. This expectedcorrectness
score should be0
unless you expect HolmesGPT to systematically succeed the evaluation. Because of this, it is important forexpected_output
to be reduced to the minimally accepted output from HolmesGPT.before_test
andafter_test
: These are setup and teardown steps to reproduce the test on a fresh environment. It is important for these to be present because as HolmesGPT's code, prompt and toolset evolve the mocks become insufficient or inaccurate. These scripts are run automatically when the env varRUN_LIVE=true
is set
Step 3: Generate Mock Tool Outputs¶
Create mock files that simulate kubectl command outputs.
The best way to do this is to:
- Deploy the test case you want to build an eval for in a kubernetes cluster (run the
before_test
script manually) - Configure HolmesGPT to connect to the cluster (via kubectl and any other relevant toolsets)
- Enable the auto generation of mock files by using the
--generate-mocks
CLI flag - Repeatedly run the eval with
ITERATIONS=100 pytest tests/llm/test_ask_holmes.py -k 99_pod_health_check --generate-mocks
Step 4: Run the Test¶
Test Case Configuration Reference¶
Common Fields¶
Field | Type | Required | Description |
---|---|---|---|
user_prompt |
string | Yes | The question or prompt for HolmesGPT |
expected_output |
string or list | Yes | Expected elements in the response |
evaluation |
dict | No | Minimum scores for test to pass |
Ask Holmes Specific Fields¶
Field | Type | Description |
---|---|---|
before_test |
string | Command to run before test (requires RUN_LIVE=true ) |
after_test |
string | Command to run after test cleanup |
Investigation Specific Fields¶
Field | Type | Description |
---|---|---|
expected_sections |
dict | Required/prohibited sections in output |
Example: Complex Investigation Test¶
user_prompt: "Investigate this CrashLoopBackOff issue"
expected_output:
- Pod is experiencing CrashLoopBackOff
- Container exits with code 1 due to configuration error
- Missing environment variable DATABASE_URL
expected_sections:
"Root Cause Analysis":
- CrashLoopBackOff
- configuration error
"Recommended Actions": true
"External Links": false
evaluation:
correctness: 0
before_test: kubectl apply -f ./manifest.yaml
after_test: kubectl delete -f ./manifest.yaml
Mock File Generation¶
Automatic Generation¶
Use the --generate-mocks
CLI flag and run with a live cluster:
Or to regenerate all existing mocks for consistency:
This captures real tool outputs and saves them as mock files.
Manual Creation¶
Create files matching the tool names used by HolmesGPT:
kubectl_describe.txt
- Pod/resource descriptionskubectl_logs.txt
- Container logskubectl_events.txt
- Kubernetes eventsprometheus_query.txt
- Metrics datafetch_loki_logs.txt
- Log aggregation results
Naming Convention¶
Mock files follow the pattern: {tool_name}_{additional_context}.txt
Examples:
- kubectl_describe_pod_nginx_default.txt
- kubectl_logs_all_containers_nginx.txt
- execute_prometheus_range_query.txt
Advanced Test Configuration¶
Toolset Configuration¶
Control which toolsets are available for a specific test by creating a toolsets.yaml
file in the test directory:
toolsets:
kubernetes/core:
enabled: true
prometheus/metrics:
enabled: true
config:
prometheus_url: "http://custom-prometheus:9090"
prometheus_username: "admin"
prometheus_password: "secretpass"
grafana/dashboards:
enabled: false # Explicitly disable toolsets
# Enable non-default toolsets
rabbitmq/core:
enabled: true
config:
clusters:
- id: rabbitmq-test
username: guest
password: guest
management_url: http://rabbitmq:15672
Use cases: - Test with limited toolsets available - Provide custom configuration (URLs, credentials) - Simulate environments where certain tools are unavailable - Test error handling when expected tools are disabled
Mock Policy Control¶
Control mock behavior on a per-test basis by adding mock_policy
to test_case.yaml
:
user_prompt: "Check cluster health"
mock_policy: "always_mock" # Options: always_mock, never_mock, inherit
expected_output:
- Cluster is healthy
Options:
- inherit
(default): Use global settings from environment/CLI flags
- Recommended for most tests
- Allows flexibility to run with or without mocks based on environment
never_mock
: Force live execution- Test automatically skipped if
RUN_LIVE
is not set - Ensures the test always runs against real tools
- Verifies actual tool behavior and integration
-
Preferred when you want to guarantee realistic testing
-
always_mock
: Always use mock data, even withRUN_LIVE=true
- Ensures deterministic behavior
- Use only when live testing is impractical (e.g., complex cluster setups, specific error conditions)
- Note: You should prefer
inherit
ornever_mock
when possible as they test the agent more realistically and are less fragile
Custom Runbooks¶
Override the default runbook catalog for specific tests by adding a runbooks
field to test_case.yaml
:
user_prompt: "I'm experiencing DNS resolution issues in my kubernetes cluster"
expected_output:
- DNS troubleshooting runbook
- fetch_runbook
# Custom runbook catalog
runbooks:
catalog:
- update_date: "2025-07-26"
description: "Runbook for troubleshooting DNS issues in Kubernetes"
link: "k8s-dns-troubleshooting.md"
- update_date: "2025-07-26"
description: "Runbook for debugging pod crashes"
link: "pod-crash-debug.md"
The runbook markdown files (e.g., k8s-dns-troubleshooting.md
) should be placed in the same directory as test_case.yaml
.
Options:
- No runbooks
field: Use default system runbooks
- runbooks: {}
: Empty catalog (no runbooks available)
- runbooks: {catalog: [...]}
: Custom runbook catalog
This is useful for: - Testing runbook selection logic - Verifying behavior when no runbooks are available - Testing custom troubleshooting procedures - Ensuring proper runbook following
Example Tests¶
The repository includes example tests demonstrating each feature:
_EXAMPLE_01_toolsets_config/
- Toolset configuration_EXAMPLE_02_mock_policy_always/
- Always use mocks_EXAMPLE_03_mock_policy_never/
- Force live execution_EXAMPLE_04_custom_runbooks/
- Custom runbook configuration_EXAMPLE_05_mock_generation/
- Mock generation workflow_EXAMPLE_06_combined_features/
- Combining multiple features
Run examples:
Live Testing with Real Resources¶
For tests that need actual Kubernetes resources:
Step 1: Create Manifest¶
manifest.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-nginx
spec:
replicas: 1
selector:
matchLabels:
app: test-nginx
template:
metadata:
labels:
app: test-nginx
spec:
containers:
- name: nginx
image: nginx:1.20
ports:
- containerPort: 80
Step 2: Configure Setup/Teardown¶
user_prompt: 'How is the test-nginx deployment performing?'
before-test: kubectl apply -f manifest.yaml
after-test: kubectl delete -f manifest.yaml
# ... rest of configuration
Step 3: Run Live Test¶
RUN_LIVE
is currently incompatible withITERATIONS
> 1.
Evaluation Scoring¶
Correctness Score¶
Measures how well the output matches expected elements: - 1: Match - 0: Mismatch
Setting Minimum Scores¶
Tagging¶
Evals are tagged for organisation and reporting purposes. The valid tags are defined in the test constants file in the repository.
Best Practices¶
Test Design¶
- Start simple: Begin with basic scenarios before complex edge cases
- Clear expectations: Write specific, measurable expected outputs
- Realistic scenarios: Base tests on actual user problems
- Incremental complexity: Build from simple to complex test cases
Mock Data Quality¶
- Representative data: Use realistic kubectl outputs and logs
- Error scenarios: Include failure modes and edge cases
- Consistent formatting: Match actual tool output formats
- Sufficient detail: Include enough information for proper diagnosis
- Run repeatedly: Run mock generation many times to ensure all investigative paths are covered by mock files
Troubleshooting Test Creation¶
Common Issues¶
Test always fails with low correctness score: - Check if expected_output matches actual LLM capabilities - Verify mock data provides sufficient information - Consider lowering score threshold temporarily
Missing tool outputs: - Ensure mock files are named correctly - Check that required toolsets are enabled - Verify mock file content is properly formatted
Inconsistent results:
- Run multiple iterations: ITERATIONS=5
- Check for non-deterministic elements in prompts
- Consider using temperature=0 for more consistent outputs
Debugging Commands¶
# Verbose output showing all details
pytest -v -s ./tests/llm/test_ask_holmes.py -k "your_test"
# Generate fresh mocks from live system
pytest ./tests/llm/test_ask_holmes.py -k "your_test" --generate-mocks
# Or regenerate ALL mocks to ensure consistency
pytest ./tests/llm/test_ask_holmes.py -k "your_test" --regenerate-all-mocks
# Skip setup/cleanup for faster debugging
pytest ./tests/llm/test_ask_holmes.py -k "your_test" --skip-setup --skip-cleanup
# Run with specific number of iterations
ITERATIONS=10 pytest ./tests/llm/test_ask_holmes.py -k "your_test"
CLI Flags Reference¶
Custom HolmesGPT Flags:
- --generate-mocks
- Generate mock files during test execution
- --regenerate-all-mocks
- Regenerate all mock files (implies --generate-mocks)
- --skip-setup
- Skip before_test
commands
- --skip-cleanup
- Skip after_test
commands
Common Pytest Flags:
- -n <number>
- Run tests in parallel
- -k <pattern>
- Run tests matching pattern
- -m <marker>
- Run tests with specific marker
- -v/-vv
- Verbose output
- -s
- Show print statements
- --no-cov
- Disable coverage
- --collect-only
- List tests without running
This completes the evaluation writing guide. The next step is setting up reporting and analysis using Braintrust.