haystack Chat generators or Agent should work intelligently around rate limits

Is your feature request related to a problem? Please describe. At the hackathon today we ran into quite a few rate limit issues with the OpenAI and Anthropic API. The main problem is that the number of tokens exceeds the rate limit for number of input tokens per minute. Because these agents might make many tool calls per minute, the number of input tokens accumulates quickly.

Describe the solution you'd like We subclassed the AnthropicChatGenerator and overwrote the run method so that the call to Anthropic would be retried for rate limit errors after a 60 second waiting time.

This worked but I could imagine more sophisticated ways were maybe users could specify rate limits for the Agent and it could then wait once a request would hit the limit. The chat generators would benefit from simple retry mechanisms.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

Apr 24 '25 17:04 mathislucka

Just to provide some context. Depending on the provider you use it is already possible to have retries set up. For example, with the OpenAIChatGenerator

from haystack.components.generators.chat.openai import OpenAIChatGenerator

gen = OpenAIChatGenerator(max_retries=5)

This handles retries including with backoffs.

Similarly the AnthropicChatGenerator could also support this, but we would need to expose the max_retries parameter of the Anthropic client. By default it seems to be set to 2.

Apr 25 '25 08:04 sjrl

Asa follow up we have added this feature to AnthropicChatGenerator in this PR https://github.com/deepset-ai/haystack-core-integrations/pull/1952

So the task for this would be to check if our other Chat Generators also expose parameters like max_retries and timeout

Sep 15 '25 10:09 sjrl

I ran into very similar rate limit issues while building itinerary-agent. When I had just one main orchestration agent doing everything the chat history was growing super fast and each call was burning tons of tokens for very little actual value. In fact, filling context with useless tokens, made things worse. Switching to a subagent design helped a lot. Subagents just branch off with only the needed context already prepared and they throw away the full chat history from the parent or orchestrator. This alone reduced token usage massively and basically stopped hitting rate limits. It could also be useful to have something like agent_history_depth parameter so chat generators do not always include the entire history by default. Anyway I do not think there is one magic knob you just turn and it solves it. Probably better to collect and share a few best practices like using subagents, trimming context, limiting history depth etc. Thoughts @sjrl ?

Sep 16 '25 12:09 vblagoje

Probably better to collect and share a few best practices like using subagents, trimming context, limiting history depth etc.

I think this sounds like a good idea! Some other things we could do:

Check that all ChatGenerators expose max_retries and timeout so we can help navigate rate limit errors even when context windows are effectively managed.
Completing this issue https://github.com/deepset-ai/haystack/issues/9786 would also be helpful by allowing users to fallback to other providers if an issue arises.
And again what you say, provide best practices surrounding agent design and ways to limit history length, context window size, etc.

Sep 16 '25 13:09 sjrl

Exactly @sjrl and I’d just add that we should lean more into the definition and use of subagents in our docs. What we described above is actually referred to as “context preservation” in the Claude docs on sub-agents, and it’s a well documented pattern worth adopting. This Claude doc page basically talks about all these aspects in detail :-)

Sep 22 '25 07:09 vblagoje