ROB-2584: Eval for KRR tool

Open nherment opened this issue 1 month ago • 1 comments

Nov 19 '25 08:11 nherment

Results of HolmesGPT evals

ask_holmes: 29/36 test cases were successful, 6 regressions, 1 setup failures

Test suite	Test case	Status
ask	01_how_many_pods	:white_check_mark:
ask	02_what_is_wrong_with_pod	:white_check_mark:
ask	04_related_k8s_events	:white_check_mark:
ask	05_image_version	:white_check_mark:
ask	09_crashpod	:white_check_mark:
ask	10_image_pull_backoff	:white_check_mark:
ask	110_k8s_events_image_pull	:white_check_mark:
ask	11_init_containers	:x:
ask	13a_pending_node_selector_basic	:white_check_mark:
ask	14_pending_resources	:white_check_mark:
ask	15_failed_readiness_probe	:white_check_mark:
ask	17_oom_kill	:white_check_mark:
ask	18_oom_kill_from_issues_history	:white_check_mark:
ask	19_detect_missing_app_details	:white_check_mark:
ask	20_long_log_file_search	:x:
ask	24_misconfigured_pvc	:white_check_mark:
ask	24a_misconfigured_pvc_basic	:white_check_mark:
ask	28_permissions_error	:construction:
ask	39_failed_toolset	:white_check_mark:
ask	41_setup_argo	:white_check_mark:
ask	42_dns_issues_steps_new_tools	:white_check_mark:
ask	43_current_datetime_from_prompt	:white_check_mark:
ask	45_fetch_deployment_logs_simple	:white_check_mark:
ask	51_logs_summarize_errors	:white_check_mark:
ask	53_logs_find_term	:white_check_mark:
ask	54_not_truncated_when_getting_pods	:white_check_mark:
ask	59_label_based_counting	:white_check_mark:
ask	60_count_less_than	:white_check_mark:
ask	61_exact_match_counting	:white_check_mark:
ask	63_fetch_error_logs_no_errors	:white_check_mark:
ask	79_configmap_mount_issue	:white_check_mark:
ask	83_secret_not_found	:white_check_mark:
ask	86_configmap_like_but_secret	:x:
ask	93_calling_datadog[0]	:x:
ask	93_calling_datadog[1]	:x:
ask	93_calling_datadog[2]	:x:

Legend

:white_check_mark: the test was successful
:minus: the test was skipped
:warning: the test failed but is known to be flaky or known to fail
:construction: the test had a setup failure (not a code regression)
:wrench: the test failed due to mock data issues (not a code regression)
:no_entry_sign: the test was throttled by API rate limits/overload
:x: the test failed and should be fixed before merging the PR

Nov 19 '25 08:11 github-actions[bot]