Working Holmes with KAITO integration - tool calling verified
Fixed some tool calling issues that were breaking Holmes when using KAITO-deployed models on AKS. The implementation has been tested and verified with both fresh Holmes installations and existing KAITO deployments.
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
Nick Thevenin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.
Walkthrough
This PR introduces KAITO-specific improvements to Holmes' LLM system by adding environment-based tool choice configuration, new accuracy and formatting prompts, revised system prompt guidance, disabled CoreInvestigationToolset defaults, custom endpoint support for classifiers, and a comprehensive evaluation runner script with Braintrust integration.
Changes
| Cohort / File(s) | Summary |
|---|---|
Core Logic Modifications holmes/core/llm.py, holmes/core/tool_calling_llm.py |
Tool choice handling broadened from requiring "auto" string to accepting any truthy value; environment variable HOLMES_TOOL_CHOICE now sources tool choice instead of hardcoded "auto", with debug logging added. |
System Prompt Updates holmes/core/prompt.py |
Added KAITO-specific anti-JSON enforcement and conciseness blocks to system prompt additions; commented out TodoWrite system reminder with KAITO patch comment. |
Prompt Templates holmes/plugins/prompts/_kaito_accuracy.jinja2, holmes/plugins/prompts/generic_ask.jinja2 |
New KAITO accuracy template with counting/verification guidance; generic_ask template enhanced with tool-call workflows, natural language response requirements, JSON prohibition, conciseness enforcement, and numerical accuracy examples. |
Toolset Configuration holmes/plugins/toolsets/investigator/core_investigation.py |
CoreInvestigationToolset initialization changed to disable by default (enabled=False, is_default=False). |
Evaluation Infrastructure run_kaito_evals.sh, tests/llm/test_ask_holmes.py, tests/llm/utils/classifiers.py, tests/llm/utils/mock_toolset.py |
New bash runner script for KAITO evaluations with CLI parsing, environment orchestration, Braintrust integration; max_steps reduced from 40 to 10; classifier endpoint support via environment variables; KAITO_CONFIG_PATH override for toolset loading. |
Configuration & Documentation pyproject.toml, kaito_improvements.md |
File-based logging disabled in pyproject.toml; new KAITO improvements document detailing strategies for reducing hallucinations and counting errors. |
Empty/Placeholder Files Set, environment, variables, kaito_eval_output.log, model |
Created empty/blank placeholder files with no executable content. |
Sequence Diagram(s)
sequenceDiagram
participant Test as Test Runner
participant Tool as ToolCallingLLM
participant Env as Environment
participant LLM as LLM Client
participant Handler as Tool Handler
Test->>Env: Check HOLMES_TOOL_CHOICE
Env-->>Tool: Return tool choice value (env var or "auto")
activate Tool
alt tools present and tool_choice truthy
Tool->>LLM: call(tools=tools, tool_choice=HOLMES_TOOL_CHOICE)
LLM->>Handler: Process tool calls
Handler-->>LLM: Tool results
LLM-->>Tool: LLM response with tool results
Tool->>Tool: Log: HOLMES_TOOL_CHOICE=<value>
else no tools or falsy tool_choice
Tool->>LLM: call(tools=None)
LLM-->>Tool: Direct LLM response
end
deactivate Tool
Tool-->>Test: Final response
Estimated code review effort
🎯 4 (Complex) | ⏱️ ~45 minutes
run_kaito_evals.sh: High logic density with complex argument parsing, environment orchestration, and test command construction requiring careful validation of all conditional branches and environment variable handling.generic_ask.jinja2: Significant prompt logic changes affecting LLM behavior and response structure; requires understanding of cascading effects on model outputs and tool-calling workflows.tests/llm/utils/classifiers.py: Multiple endpoint handling paths (custom/KAITO, Azure, standard OpenAI) with new environment variable dependencies; logic acrosscreate_llm_clientandevaluate_correctnessrequires tracing effective URL/key/model resolution.holmes/core/tool_calling_llm.py&holmes/core/llm.py: Tool choice logic changes affecting broader system behavior; interaction between truthy check and environment variable sourcing needs validation.- Coherence across multiple modified files: Changes to prompts, tool choice, endpoints, and toolset defaults interact; holistic understanding required to validate intended behavior.
Possibly related PRs
- robusta-dev/holmesgpt#563: Modifies toolset default/enabled behavior (KubernetesLogsToolset and mock toolset enabling), directly related to CoreInvestigationToolset disabled-by-default changes.
- robusta-dev/holmesgpt#823: Adds AI safety partial to prompt templates and modifies system guidance blocks similar to KAITO-specific accuracy and conciseness additions in this PR.
- robusta-dev/holmesgpt#729: Modifies
generic_ask.jinja2prompts andtests/llm/utils/mock_toolset.pyloader in overlapping ways.
Suggested reviewers
- Sheeproid
- arikalon1
- moshemorad
Pre-merge checks and finishing touches
❌ Failed checks (1 warning)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Docstring Coverage | ⚠️ Warning | Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%. | You can run @coderabbitai generate docstrings to improve docstring coverage. |
✅ Passed checks (2 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title check | ✅ Passed | The title clearly and concisely summarizes the main change: fixing tool calling issues for Holmes with KAITO integration and verifying the fix works. |
| Description check | ✅ Passed | The description is directly related to the changeset, explaining the problem being fixed (tool calling issues with KAITO models) and that the solution has been tested. |
✨ Finishing touches
- [ ] 📝 Generate docstrings
🧪 Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
moved to https://github.com/HolmesGPT/holmesgpt/pull/1186