holmesgpt
holmesgpt copied to clipboard
Improve the algorithm for truncating tools
Holmes currently truncates too aggressively (code here) by trying to be ‘fair’ and give each tool the same output budget, even if some tools don’t need it. This means that when tools can be too aggressively truncated, leaving unused token budget that ends up being used by no tools.
One idea that came up from our team:
Can we have Holmes call the LLM (perhaps a cheaper model) to analyze/summarize the larger outputs, e.g. intelligently truncating the outputs as a first step? Then append the summarized data to the user prompt + runbook to produce the response.
As discussed today:
- For tool calls with large outputs (> x tokens), do a first pass of summarization by calling a smaller LLM model
- Data summarization on a per-tool basis, certain tools can enable it when the output size is larger than some threshold
- Specify which LLM model to use for summarization (e.g. gpt-4o mini)
- Append the summarized data to the final LLM call to generate the diagnosis
The main tradeoff with summarizing data beforehand is latency.