Skip to content

Evaluations

HolmesGPT uses automated evaluations (evals) to ensure consistent performance across different LLM models and to catch regressions during development. These evaluations test the system's ability to correctly diagnose Kubernetes issues.

Overview

The eval system comprises two main test suites:

  • Ask Holmes: Tests single-question interactions with HolmesGPT
  • Investigate: Tests HolmesGPT's ability to investigate specific issues reported by AlertManager

Evals use fixtures that simulate real Kubernetes environments and tool outputs, allowing comprehensive testing without requiring live clusters.

While results are tracked and analyzed using Braintrust, Braintrust is not necessary to writing, running and debugging evals.

Example

Below is an example of a report added to pull requests to catch regressions:

Test suite Test case Status
ask_holmes 01_how_many_pods ⚠
ask_holmes 02_what_is_wrong_with_pod ✅
ask_holmes 02_what_is_wrong_with_pod_LOKI ✅
ask_holmes 03_what_is_the_command_to_port_forward ✅
ask_holmes 04_related_k8s_events ✅
ask_holmes 05_image_version ✅
ask_holmes 06_explain_issue ✅
ask_holmes 07_high_latency ✅
ask_holmes 07_high_latency_LOKI ✅
ask_holmes 08_sock_shop_frontend ✅
ask_holmes 09_crashpod ✅
ask_holmes 10_image_pull_backoff ✅
ask_holmes 11_init_containers ✅
ask_holmes 12_job_crashing ✅
ask_holmes 12_job_crashing_CORALOGIX ✅
ask_holmes 12_job_crashing_LOKI ⚠
ask_holmes 13_pending_node_selector ✅
ask_holmes 14_pending_resources ✅
ask_holmes 15_failed_readiness_probe ✅
ask_holmes 16_failed_no_toolset_found ✅
ask_holmes 17_oom_kill ✅
ask_holmes 18_crash_looping_v2 ✅
ask_holmes 19_detect_missing_app_details ✅
ask_holmes 20_long_log_file_search_LOKI ✅
ask_holmes 21_job_fail_curl_no_svc_account ⚠
ask_holmes 22_high_latency_dbi_down ✅
ask_holmes 23_app_error_in_current_logs ✅
ask_holmes 23_app_error_in_current_logs_LOKI ✅
ask_holmes 24_misconfigured_pvc ✅
ask_holmes 25_misconfigured_ingress_class ⚠
ask_holmes 26_multi_container_logs ⚠
ask_holmes 27_permissions_error_no_helm_tools ✅
ask_holmes 28_permissions_error_helm_tools_enabled ✅
ask_holmes 29_events_from_alert_manager ✅
ask_holmes 30_basic_promql_graph_cluster_memory ✅
ask_holmes 31_basic_promql_graph_pod_memory ✅
ask_holmes 32_basic_promql_graph_pod_cpu ✅
ask_holmes 33_http_latency_graph ✅
ask_holmes 34_memory_graph ✅
ask_holmes 35_tempo ✅
ask_holmes 36_argocd_find_resource ✅
ask_holmes 37_argocd_wrong_namespace ⚠
ask_holmes 38_rabbitmq_split_head ✅
ask_holmes 39_failed_toolset ✅
ask_holmes 40_disabled_toolset ✅
ask_holmes 41_setup_argo ✅
investigate 01_oom_kill ✅
investigate 02_crashloop_backoff ✅
investigate 03_cpu_throttling ✅
investigate 04_image_pull_backoff ✅
investigate 05_crashpod ✅
investigate 05_crashpod_LOKI ✅
investigate 06_job_failure ✅
investigate 07_job_syntax_error ✅
investigate 08_memory_pressure ✅
investigate 09_high_latency ✅
investigate 10_KubeDeploymentReplicasMismatch ✅
investigate 11_KubePodCrashLooping ✅
investigate 12_KubePodNotReady ✅
investigate 13_Watchdog ✅
investigate 14_tempo ✅

Legend

  • ✅ the test was successful
  • ⚠ the test failed but is known to be flakky or known to fail
  • ❌ the test failed and should be fixed before merging the PR

Why Evaluations Matter

Evaluations serve several critical purposes:

  1. Quality Assurance: Ensure HolmesGPT provides accurate diagnostics and recommendations
  2. Model Comparison: Compare performance across different LLM models (GPT-4, Claude, Gemini, etc.)
  3. Regression Testing: Catch performance degradations when updating code or dependencies
  4. Toolset Validation: Verify that new toolsets and integrations work correctly
  5. Continuous Improvement: Identify areas where HolmesGPT needs enhancement

How to Run Evaluations

Basic Usage

Run all evaluations:

pytest ./tests/llm/test_*.py

By default the tests load and present mock files to the LLM whenever it asks for them. If a mock file is not present for a tool call, the tool call is passed through to the live tool itself. In a lot of cases this can cause the eval to fail unless the live environment (k8s cluster) matches what the LLM expects.

Run specific test suite:

pytest ./tests/llm/test_ask_holmes.py
pytest ./tests/llm/test_investigate.py

Run a specific test case:

pytest ./tests/llm/test_ask_holmes.py -k "01_how_many_pods"

It is possible to investigate and debug why an eval fails by the output provided in the console. The output includes the correctness score, the reasoning for the score, information about what tools were called, the expected answer, as well as the LLM's answer.

Environment Variables

Configure evaluations using these environment variables:

Variable Example Description
MODEL MODEL=anthropic/claude-3.5 Specify which LLM model to use
CLASSIFIER_MODEL CLASSIFIER_MODEL=gpt-4o The LLM model to use for scoring the answer (LLM as judge). Defaults to MODEL
ITERATIONS ITERATIONS=3 Run each test multiple times for consistency checking
RUN_LIVE RUN_LIVE=true Execute before-test and after-test commands, ignore mock files
BRAINTRUST_API_KEY BRAINTRUST_API_KEY=sk-1dh1...swdO02 API key for Braintrust integration
UPLOAD_DATASET UPLOAD_DATASET=true Sync dataset to Braintrust (safe, separated by branch)
PUSH_EVALS_TO_BRAINTRUST PUSH_EVALS_TO_BRAINTRUST=true Upload evaluation results to Braintrust
EXPERIMENT_ID EXPERIMENT_ID=my_baseline Custom experiment name for result tracking

Simple Example

Run a comprehensive evaluation:

export MODEL=gpt-4o

# Run with parallel execution for speed
pytest -n 10 ./tests/llm/test_*.py

Live Testing

For tests that require actual Kubernetes resources:

export RUN_LIVE=true

pytest ./tests/llm/test_ask_holmes.py -k "specific_test"

Live testing requires a Kubernetes cluster and will execute before-test and after-test commands to set up/tear down resources. Not all tests support live testing. Some tests require manual setup.

Model Comparison Workflow

  1. Create Baseline: Run evaluations with a reference model

    EXPERIMENT_ID=baseline_gpt4o MODEL=gpt-4o pytest -n 10 ./tests/llm/test_*
    

  2. Test New Model: Run evaluations with the model you want to compare

    EXPERIMENT_ID=test_claude35 MODEL=anthropic/claude-3.5 pytest -n 10 ./tests/llm/test_*
    

  3. Compare Results: Use Braintrust dashboard to analyze performance differences

Troubleshooting

Common Issues

  1. Missing BRAINTRUST_API_KEY: Some tests are skipped without this key
  2. Live test failures: Ensure Kubernetes cluster access and proper permissions
  3. Mock file mismatches: Regenerate mocks with generate_mocks: true
  4. Timeout errors: Increase test timeout or check network connectivity

Debug Mode

Enable verbose output:

pytest -v -s ./tests/llm/test_ask_holmes.py -k "specific_test"

This shows detailed output including: - Expected vs actual results - Tool calls made by the LLM - Evaluation scores and rationales - Debugging information