Evaluations¶
HolmesGPT uses automated evaluations (evals) to ensure consistent performance across different LLM models and to catch regressions during development. These evaluations test the system's ability to correctly diagnose Kubernetes issues.
- Writing Evaluations - Learn how to create new test cases and evaluations
- Reporting with Braintrust - Analyze results and debug failures using Braintrust
Overview¶
The eval system comprises two main test suites:
- Ask Holmes: Tests single-question interactions with HolmesGPT
- Investigate: Tests HolmesGPT's ability to investigate specific issues reported by AlertManager
Evals use fixtures that simulate real Kubernetes environments and tool outputs, allowing comprehensive testing without requiring live clusters.
While results are tracked and analyzed using Braintrust, Braintrust is not necessary to writing, running and debugging evals.
Example¶
Below is an example of a report added to pull requests to catch regressions:
Legend
the test was successful
the test failed but is known to be flakky or known to fail
the test failed and should be fixed before merging the PR
Why Evaluations Matter¶
Evaluations serve several critical purposes:
- Quality Assurance: Ensure HolmesGPT provides accurate diagnostics and recommendations
- Model Comparison: Compare performance across different LLM models (GPT-4, Claude, Gemini, etc.)
- Regression Testing: Catch performance degradations when updating code or dependencies
- Toolset Validation: Verify that new toolsets and integrations work correctly
- Continuous Improvement: Identify areas where HolmesGPT needs enhancement
How to Run Evaluations¶
Basic Usage¶
Run all evaluations:
By default the tests load and present mock files to the LLM whenever it asks for them. If a mock file is not present for a tool call, the tool call is passed through to the live tool itself. In a lot of cases this can cause the eval to fail unless the live environment (k8s cluster) matches what the LLM expects.
Run specific test suite:
Run a specific test case:
It is possible to investigate and debug why an eval fails by the output provided in the console. The output includes the correctness score, the reasoning for the score, information about what tools were called, the expected answer, as well as the LLM's answer.
Environment Variables¶
Configure evaluations using these environment variables:
Variable | Example | Description |
---|---|---|
MODEL |
MODEL=anthropic/claude-3.5 |
Specify which LLM model to use |
CLASSIFIER_MODEL |
CLASSIFIER_MODEL=gpt-4o |
The LLM model to use for scoring the answer (LLM as judge). Defaults to MODEL |
ITERATIONS |
ITERATIONS=3 |
Run each test multiple times for consistency checking |
RUN_LIVE |
RUN_LIVE=true |
Execute before-test and after-test commands, ignore mock files |
BRAINTRUST_API_KEY |
BRAINTRUST_API_KEY=sk-1dh1...swdO02 |
API key for Braintrust integration |
UPLOAD_DATASET |
UPLOAD_DATASET=true |
Sync dataset to Braintrust (safe, separated by branch) |
PUSH_EVALS_TO_BRAINTRUST |
PUSH_EVALS_TO_BRAINTRUST=true |
Upload evaluation results to Braintrust |
EXPERIMENT_ID |
EXPERIMENT_ID=my_baseline |
Custom experiment name for result tracking |
Simple Example¶
Run a comprehensive evaluation:
Live Testing¶
For tests that require actual Kubernetes resources:
Live testing requires a Kubernetes cluster and will execute before-test
and after-test
commands to set up/tear down resources. Not all tests support live testing. Some tests require manual setup.
Model Comparison Workflow¶
-
Create Baseline: Run evaluations with a reference model
-
Test New Model: Run evaluations with the model you want to compare
-
Compare Results: Use Braintrust dashboard to analyze performance differences
Troubleshooting¶
Common Issues¶
- Missing BRAINTRUST_API_KEY: Some tests are skipped without this key
- Live test failures: Ensure Kubernetes cluster access and proper permissions
- Mock file mismatches: Regenerate mocks with
generate_mocks: true
- Timeout errors: Increase test timeout or check network connectivity
Debug Mode¶
Enable verbose output:
This shows detailed output including: - Expected vs actual results - Tool calls made by the LLM - Evaluation scores and rationales - Debugging information