Writing Evaluations¶
This guide explains how to create new evaluations for HolmesGPT. Evaluations test the system's ability to correctly diagnose issues and provide accurate recommendations.
- Evaluations Overview - Introduction to HolmesGPT's evaluation system
- Reporting with Braintrust - Analyze results and debug failures using Braintrust
Overview¶
HolmesGPT supports two types of evaluations:
- Ask Holmes Tests: Chat-like interactions (
tests/llm/test_ask_holmes.py
) - Investigation Tests: Issue analysis for events triggered by AlertManager (
tests/llm/test_investigate.py
)
Each test consists of:
- A test case definition (test_case.yaml
)
- Mock tool outputs (e.g., kubectl_describe.txt
)
- Optional Kubernetes manifests for live testing
- Optional custom toolset configurations
High-Level Steps¶
- Choose test type: Ask Holmes vs Investigation. Choose Ask Holmes for most use cases. Choose Investigations for issues triggered by AlertManager
- Create a test folder: Use numbered naming convention
- Define your test case:
- Create
test_case.yaml
with prompt and expectations - Define kubectl or helm setup and teardown manifests
- Generate mock data: Using a live system
- Set evaluation criteria: Define minimum scores for test success
- Test and iterate: Run the test and refine as needed
Step-by-Step Example: Creating an Ask Holmes Test¶
Let's create a simple test that asks about pod health status.
Step 1: Create Test Folder¶
mkdir tests/llm/fixtures/test_ask_holmes/99_pod_health_check
cd tests/llm/fixtures/test_ask_holmes/99_pod_health_check
Step 2: Create test_case.yaml¶
user_prompt: 'Is the nginx pod healthy?'
expected_output:
- nginx pod is healthy
evaluation:
correctness: 1
before_test: kubectl apply -f ./manifest.yaml
after_test: kubectl delete -f ./manifest.yaml
user_prompt
: This is the question that will trigger Holmes' investigationexpected_output
: This is a list of expected elements that MUST be found in Holmes' answer. The combination of these elements lead to acorrectness
score based on HolmesGPT's output. Thisexpected_output
will be compared against HolmesGPT's answer and evaluated by a LLM ('LLM as judge'). The resulting score is calledcorrectness
and is a binary score with a value of either0
or1
. HolmesGPT's answer is score0
is any of the expected element is not present in the answer,1
if all expected elements are preent in the answer.evaluation.correctness
: This is the expected correctness score and is used for pytest to fail the test. This expectedcorrectness
score should be0
unless you expect HolmesGPT to systematically succeed the evaluation. Because of this, it is important forexpected_output
to be reduced to the minimally accepted output from HolmesGPT.before_test
andafter_test
: These are setup and teardown steps to reproduce the test on a fresh environment. It is important for these to be present because as HolmesGPT's code, prompt and toolset evolve the mocks become insufficient or inaccurate. These scripts are run automatically when the env varRUN_LIVE=true
is set
Step 3: Generate Mock Tool Outputs¶
Create mock files that simulate kubectl command outputs.
The best way to do this is to:
- Deploy the test case you want to build an eval for in a kubernetes cluster (run the
before_test
script manually) - Configure HolmesGPT to connect to the cluster (via kubectl and any other relevant toolsets)
- Enable the auto generation of mock files by setting
generate_mocks: True
in yourtest_case.yaml
- Repeatedly run the eval with
ITERATIONS=100 pytest tests/llm/test_ask_holmes.py -k 99_pod_health_check
- Removing the prefix
.AUTOGENERATED
from all autogenerated files
Step 4: Run the Test¶
Test Case Configuration Reference¶
Common Fields¶
Field | Type | Required | Description |
---|---|---|---|
user_prompt |
string | Yes | The question or prompt for HolmesGPT |
expected_output |
string or list | Yes | Expected elements in the response |
evaluation |
dict | No | Minimum scores for test to pass |
Ask Holmes Specific Fields¶
Field | Type | Description |
---|---|---|
before_test |
string | Command to run before test (requires RUN_LIVE=true ) |
after_test |
string | Command to run after test cleanup |
generate_mocks |
boolean | Whether to generate new mock files |
Investigation Specific Fields¶
Field | Type | Description |
---|---|---|
expected_sections |
dict | Required/prohibited sections in output |
Example: Complex Investigation Test¶
user_prompt: "Investigate this CrashLoopBackOff issue"
expected_output:
- Pod is experiencing CrashLoopBackOff
- Container exits with code 1 due to configuration error
- Missing environment variable DATABASE_URL
expected_sections:
"Root Cause Analysis":
- CrashLoopBackOff
- configuration error
"Recommended Actions": true
"External Links": false
evaluation:
correctness: 0
before_test: kubectl apply -f ./manifest.yaml
after_test: kubectl delete -f ./manifest.yaml
Mock File Generation¶
Automatic Generation¶
Set generate_mocks: true
in test_case.yaml
and run with a live cluster:
This captures real tool outputs and saves them as mock files.
Manual Creation¶
Create files matching the tool names used by HolmesGPT:
kubectl_describe.txt
- Pod/resource descriptionskubectl_logs.txt
- Container logskubectl_events.txt
- Kubernetes eventsprometheus_query.txt
- Metrics datafetch_loki_logs.txt
- Log aggregation results
Naming Convention¶
Mock files follow the pattern: {tool_name}_{additional_context}.txt
Examples:
- kubectl_describe_pod_nginx_default.txt
- kubectl_logs_all_containers_nginx.txt
- execute_prometheus_range_query.txt
Toolset Configuration¶
Some tests require specific toolsets. Create a toolsets.yaml
file:
toolsets:
- name: kubernetes
enabled: true
- name: prometheus
enabled: true
config:
prometheus_url: http://localhost:9090 # requires port-forward
- name: grafana_loki
enabled: true
config:
base_url: http://localhost:3000 # requires port-forward
api_key: "{{env.GRAFANA_API_KEY}}"
grafana_datasource_uid: "{{env.GRAFANA_DATASOURCE_UID}}"
Live Testing with Real Resources¶
For tests that need actual Kubernetes resources:
Step 1: Create Manifest¶
manifest.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-nginx
spec:
replicas: 1
selector:
matchLabels:
app: test-nginx
template:
metadata:
labels:
app: test-nginx
spec:
containers:
- name: nginx
image: nginx:1.20
ports:
- containerPort: 80
Step 2: Configure Setup/Teardown¶
user_prompt: 'How is the test-nginx deployment performing?'
before-test: kubectl apply -f manifest.yaml
after-test: kubectl delete -f manifest.yaml
# ... rest of configuration
Step 3: Run Live Test¶
RUN_LIVE
is currently incompatible withITERATIONS
> 1.
Evaluation Scoring¶
Correctness Score¶
Measures how well the output matches expected elements: - 1: Match - 0: Mismatch
Setting Minimum Scores¶
Best Practices¶
Test Design¶
- Start simple: Begin with basic scenarios before complex edge cases
- Clear expectations: Write specific, measurable expected outputs
- Realistic scenarios: Base tests on actual user problems
- Incremental complexity: Build from simple to complex test cases
Mock Data Quality¶
- Representative data: Use realistic kubectl outputs and logs
- Error scenarios: Include failure modes and edge cases
- Consistent formatting: Match actual tool output formats
- Sufficient detail: Include enough information for proper diagnosis
- Run repeatedly: Run mock generation many times to ensure all investigative paths are covered by mock files
Troubleshooting Test Creation¶
Common Issues¶
Test always fails with low correctness score: - Check if expected_output matches actual LLM capabilities - Verify mock data provides sufficient information - Consider lowering score threshold temporarily
Missing tool outputs: - Ensure mock files are named correctly - Check that required toolsets are enabled - Verify mock file content is properly formatted
Inconsistent results:
- Run multiple iterations: ITERATIONS=5
- Check for non-deterministic elements in prompts
- Consider using temperature=0 for more consistent outputs
Debugging Commands¶
# Verbose output showing all details
pytest -v -s ./tests/llm/test_ask_holmes.py -k "your_test"
# Generate fresh mocks from live system
# set `generate_mocks: True` in test_case.yaml` and then:
pytest ./tests/llm/test_ask_holmes.py -k "your_test"
This completes the evaluation writing guide. The next step is setting up reporting and analysis using Braintrust.