holmesgpt ROB-2161 improved truncation methodology

Sep 19 '25 14:09 nherment

Walkthrough

Adds StructuredToolResult.llm_data and per-tool token accounting (ToolCallResult.size); integrates hierarchical Prometheus and New Relic metric compression into tool flows; updates context-window limiter to use per-call sizes; and introduces tests, an HTTP traffic generator fixture, and a Kubernetes manifest for end-to-end testing.

Changes

Cohort / File(s)	Summary
Tool result model `holmes/core/tools.py`	Add public field `llm_data: Optional[str] = None` to `StructuredToolResult`.
Result formatting logic `holmes/core/models.py`	In `format_tool_result_data`, if `tool_result.llm_data` exists and `tool_result.data` is a JSON string containing both `"random_key"` and `"data"`, parse and shallow-copy the object, replace inner `"data"` with `llm_data`, set `llm_data=None`, swallow exceptions, then proceed with existing string/non-string handling.
Per-tool-call token accounting `holmes/core/tool_calling_llm.py`	Construct `ToolCallResult`, build message representation, count tokens via LLM, set `ToolCallResult.size`, include `token_count` in tool-call logs, and return the enriched result. `ToolCallResult` gains public `size`.
Context window limiter `holmes/core/tools_utils/tool_context_window_limiter.py`	Lazily compute `tool_call_result.size` when unset, use `size` for threshold/percentage calculations and error reporting, pass `size` to Sentry metadata, and clear `size` after error handling.
Prometheus compression module (new) `holmes/plugins/toolsets/prometheus/data_compression.py`	New module with `CompressedMetric`, `Group`, `PreFilteredMetrics`, conversion/grouping/prefiltering, and rendering/summarization helpers (`format_labels`, `format_data`, `compact_metrics`, etc.).
Prometheus toolset integration `holmes/plugins/toolsets/prometheus/prometheus.py`	Add config fields `compress_range_metrics` and `compress_range_metrics_minimum_ratio`; convert range results to compressed metrics, compute original vs compressed sizes and ratio; set `StructuredToolResult.llm_data` to compressed payload when beneficial and attach `compression_info` and `raw_data`; otherwise keep original data and produce summaries for overly-large responses.
New Relic metrics compression `holmes/plugins/toolsets/newrelic/newrelic.py`	Add `compact_metrics_data` flow, change `to_prometheus_records` to return `PromResponse`, attach `llm_data` when compression is beneficial, and add config fields `compact_metrics` / `compact_metrics_minimum_ratio`.
Tests for compression `tests/plugins/toolsets/prometheus/test_data_compression.py`, `.../raw_prometheus_data.json`	New unit tests for label-frequency detection, grouping, formatting, and realistic summarization/ratio assertions exercising the compression helpers.
LLM test fixture & toolset config `tests/llm/fixtures/test_ask_holmes/160_http_404_monitoring/test_case.yaml`, `.../toolsets.yaml`	New test case describing periodic 404s, setup/teardown steps, and toolset fixture pointing Prometheus to `http://localhost:9160`.
Traffic generator & K8s manifest (fixtures) `tests/llm/fixtures/test_ask_holmes/160_http_404_monitoring/http_traffic_generator.py`, `.../manifest.yaml`	New high-cardinality HTTP traffic generator module and Kubernetes manifest (namespace, secret embedding script, deployments, Service, ServiceMonitor, traffic-simulator) for end-to-end testing.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Tool as PrometheusTool
  participant Prom as Prometheus API
  participant DC as DataCompression
  participant FM as Formatter
  Note over Tool,DC: Range query execution with optional compression
  User->>Tool: ExecuteRangeQuery(params)
  Tool->>Prom: HTTP range query
  Prom-->>Tool: Raw JSON result
  Tool->>DC: Convert raw -> CompressedMetric[]
  DC-->>Tool: Compressed metrics + rendered payload
  Tool->>Tool: Compute original_size, compressed_size, ratio
  alt Compression beneficial
    Tool->>FM: Use compressed payload as llm_data
    FM-->>Tool: llm_data string
    Note right of Tool: Set StructuredToolResult.llm_data
  else Not beneficial
    Note right of Tool: Keep original data, attach compression_info
  end
  opt Response too large
    Tool->>FM: create_data_summary_for_large_result(...)
    FM-->>Tool: Summary text
  end
  Tool-->>User: StructuredToolResult { data, llm_data?, compression_info? }

sequenceDiagram
  autonumber
  participant Runner
  participant Tool as Tool Runner
  participant LLM as LLM Client
  participant Log as Logger
  Note over Tool,LLM: Per-tool-call token counting
  Runner->>Tool: Invoke tool
  Tool->>Tool: Build ToolCallResult
  Tool->>LLM: as_tool_call_message -> count_tokens()
  LLM-->>Tool: token_count
  Tool->>Tool: Set ToolCallResult.size = token_count
  Tool->>Log: Log metadata { token_count: size, ... }
  Tool-->>Runner: ToolCallResult(size, ...)

sequenceDiagram
  autonumber
  participant Limiter as ContextWindowLimiter
  participant Res as ToolCallResult
  participant LLM as LLM Client
  Note over Limiter,Res: Size-based window check with lazy init
  Limiter->>Res: Read size
  alt size unset
    Limiter->>LLM: count_tokens(as_tool_call_message)
    LLM-->>Limiter: token_count
    Limiter->>Res: Set size = token_count
  end
  alt size > max_allowed
    Limiter-->>Limiter: Reject / truncate per policy
  else
    Limiter-->>Limiter: Accept
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

robusta-dev/holmesgpt#975 — Related Prometheus toolset changes and range-query handling overlap with compression/llm_data integration.
robusta-dev/holmesgpt#963 — Overlaps per-tool LLM token-counting and logging changes in holmes/core/tool_calling_llm.py.
robusta-dev/holmesgpt#919 — Touches format_tool_result_data and serialization surfaces; overlaps llm_data-aware formatting logic.

Suggested reviewers

moshemorad
aantn

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request description is completely missing and therefore provides no context or summary for the extensive changes, which span multiple modules and introduce significant new functionality. Without any description, reviewers cannot quickly grasp the intent, scope, or objectives of this work. Including a high-level summary of the objectives and key updates would greatly improve the review process.	Please add a descriptive pull request description that outlines the goals of this change, such as improved LLM data handling via JSON parsing, token usage tracking, and the addition of Prometheus/NewRelic compression features. A clear summary will guide reviewers and ensure alignment on the intended enhancements.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title Check	❓ Inconclusive	The title “ROB-2161 improved truncation methodology” is too generic and does not clearly reflect the extensive changes in this pull request, which include JSON parsing for LLM data, token usage tracking, and new data compression pipelines for Prometheus and NewRelic integrations. It fails to convey the primary enhancement or the scope of the work to a reviewer scanning the history. A more specific title highlighting the core improvements in LLM data handling and truncation logic would provide clearer context.	Please rename the pull request to summarize the main change succinctly, for example “Enhance LLM response truncation with JSON llm_data parsing, token counting, and compression utilities.” This will help reviewers immediately understand the key improvements without needing to inspect the full diff.

✨ Finishing touches

[ ] 📝 Generate Docstrings

🧪 Generate unit tests

[ ] Create PR with unit tests
[ ] Post copyable unit tests in a comment
[ ] Commit unit tests in branch ROB-2161_improved_truncation_methodology

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Sep 19 '25 14:09 coderabbitai[bot]

Results of HolmesGPT evals

ask_holmes: 33/36 test cases were successful, 1 regressions, 1 setup failures

Test suite	Test case	Status
ask	01_how_many_pods	:white_check_mark:
ask	02_what_is_wrong_with_pod	:white_check_mark:
ask	04_related_k8s_events	:white_check_mark:
ask	05_image_version	:white_check_mark:
ask	09_crashpod	:white_check_mark:
ask	10_image_pull_backoff	:white_check_mark:
ask	110_k8s_events_image_pull	:white_check_mark:
ask	11_init_containers	:white_check_mark:
ask	13a_pending_node_selector_basic	:white_check_mark:
ask	14_pending_resources	:white_check_mark:
ask	15_failed_readiness_probe	:white_check_mark:
ask	17_oom_kill	:white_check_mark:
ask	19_detect_missing_app_details	:white_check_mark:
ask	20_long_log_file_search	:white_check_mark:
ask	24_misconfigured_pvc	:white_check_mark:
ask	24a_misconfigured_pvc_basic	:white_check_mark:
ask	28_permissions_error	:construction:
ask	33_cpu_metrics_discovery	:x:
ask	39_failed_toolset	:white_check_mark:
ask	41_setup_argo	:white_check_mark:
ask	42_dns_issues_steps_new_tools	:warning:
ask	43_current_datetime_from_prompt	:white_check_mark:
ask	45_fetch_deployment_logs_simple	:white_check_mark:
ask	51_logs_summarize_errors	:white_check_mark:
ask	53_logs_find_term	:white_check_mark:
ask	54_not_truncated_when_getting_pods	:white_check_mark:
ask	59_label_based_counting	:white_check_mark:
ask	60_count_less_than	:white_check_mark:
ask	61_exact_match_counting	:white_check_mark:
ask	63_fetch_error_logs_no_errors	:white_check_mark:
ask	79_configmap_mount_issue	:white_check_mark:
ask	83_secret_not_found	:white_check_mark:
ask	86_configmap_like_but_secret	:white_check_mark:
ask	93_calling_datadog[0]	:white_check_mark:
ask	93_calling_datadog[1]	:white_check_mark:
ask	93_calling_datadog[2]	:white_check_mark:

Legend

:white_check_mark: the test was successful
:minus: the test was skipped
:warning: the test failed but is known to be flaky or known to fail
:construction: the test had a setup failure (not a code regression)
:wrench: the test failed due to mock data issues (not a code regression)
:no_entry_sign: the test was throttled by API rate limits/overload
:x: the test failed and should be fixed before merging the PR

Sep 29 '25 05:09 github-actions[bot]