OpenHands (feat) Configure fallback llm's in case of rate limit errors

What problem or use case are you trying to solve? If a rate limit is hit it just gets stuck in a loop hammering the api

Describe the UX of the solution you'd like Provide a temporary modal popup that says rate limit is hit, that provides an option to automatically switch to the next api in the list. The popup should automatically go away when the rate limit expires

Additional context

opendevin:ERROR: agent_controller.py:110 - GroqException - Error code: 429 - {'error': {'message': 'Rate limit reached for model `llama3-8b-8192` in organization `org_xxx` on tokens per minute (TPM): Limit 7500, Used 16673, Requested ~3658. Please try again in 1m42.653s. Visit https://console.groq.com/docs/rate-limits for more information.', 'type': 'tokens', 'code': 'rate_limit_exceeded'}}

There's more here too, we can keep track of tokens and not waste expensive models on cheap operations. Will open another issue for that.

Apr 21 '24 09:04 dagelf

Also, it shouldn't think that each error is a new step..... ie. if needs to distinguish between errors calling the completion API itself, and errors resulting from the code being worked on.

Apr 21 '24 09:04 dagelf

I've hit this also recently with Anthropic API. They have requests/minute, tokens/minute, requests/day limits and OpenDevin quickly (within minute) hit tokens/minute. Since in this case, the rate is known perhaps we can support configuring these limits and sleeping on the api request somewhere until the rate is respected...

May 14 '24 19:05 barsuna

@barsuna Currently you can configure a few options... whose documentation I cannot find anymore, maybe it got lost somehow, will fix. They are: config.toml

...
[llm]
num_retries=5
retry_min_wait=3
retry_max_wait=60

You can add them in the config.toml file and tweak them as you want. The minimum and maximum wait are in seconds. They represent how long to wait once it hits the rate limit. You may want to make min wait relatively high, unlike the default of 3 seconds, for example?

May 14 '24 21:05 enyst

Thanks @enyst! I haven't found a way with litellm to shape outgoing call rates (maybe there is a way to make it act smartly on code 429 / 529?), so i have prototyped external proxy that does shaping of the calls to respect the rate-limits. So rate-limiting is not an issue anymore, but title says 'elegantly', so i guess that is still to be addressed - 2 proxies is difficult to call elegant.

INFO:     172.17.0.2:43720 - "POST /api/generate HTTP/1.1" 200 OK
2024-05-15T22:52:04.260810 waiting due to excessive TPM (20200)
2024-05-15T22:52:08.262181 excessive TPM cleared
2024-05-15T22:52:08.262292 calling https://api.anthropic.com/v1/messages
2024-05-15T22:52:40.647955 response code 200
tokens_in/out: 2430/924
INFO:     172.17.0.2:36076 - "POST /api/generate HTTP/1.1" 200 OK
2024-05-15T22:52:40.670185 calling https://api.anthropic.com/v1/messages
2024-05-15T22:52:48.975432 response code 200

As usual, this just revealed next problem - tokens burn really fast. Need to get back to tuning some local model to produce something resembling claude3/gpt4

May 15 '24 21:05 barsuna

@barsuna Interesting!

maybe there is a way to make it act smartly on code 429

That's what it's doing, and isn't litellm, it's in opendevin. We have those configuration options you can use. On 429, it starts waiting, and you can set for how long. The default is a minimum of 3 seconds, which with Anthropic it's not useful, it's too low. I found that values like

retry_min_wait=20
num_retries=15

are better. It literally makes Claude usable at relatively low tiers, otherwise I couldn't get anything done.

May 16 '24 00:05 enyst

got it now, thanks @enyst

May 17 '24 13:05 barsuna

For me, It crashed on the Message Token Limit Error. The tool should handle this response from OpenAI so it doesnt crash and just shows a warning that the request exceeded the limit.

Jun 04 '24 11:06 0xWick

@barsuna Currently you can configure a few options... whose documentation I cannot find anymore, maybe it got lost somehow, will fix. They are: config.toml
...
[llm]
num_retries=5
retry_min_wait=3
retry_max_wait=60
You can add them in the config.toml file and tweak them as you want. The minimum and maximum wait are in seconds. They represent how long to wait once it hits the rate limit. You may want to make min wait relatively high, unlike the default of 3 seconds, for example?

Any idea if these are able to be modified if running the docker container? or would I just need to include a config.toml file in my workspace directory?

Jun 24 '24 22:06 krism142

Apologies for lag @krism142, i havent gone this route to set rate-limits, but according to https://docs.all-hands.dev/modules/usage/llms one could set them using env variables which can be passed when starting container - should help avoid persistency issues or need for patching containers.

Jul 01 '24 07:07 barsuna

Yes, you can set them as environment vars, using -e just like others are set in the docker command. It just needs the uppercase variant then, like this: -e LLM_NUM_RETRIES=15 and -e LLM_RETRY_MIN_WAIT=20.

Jul 01 '24 13:07 enyst

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Aug 01 '24 01:08 github-actions[bot]

Just for info: yesterday's merge of #3678 fixes the above mentioned issue by @0xWick so that upon an error the agent no longer becomes broken (unrelated to retries/timing values here)

Sep 05 '24 10:09 tobitege

This should be fixed now by #3729 , closing this for now.

Sep 05 '24 18:09 tobitege

Oops, I overlooked the detail about specifying fallback llm's for this. Reopened this. :)

Sep 05 '24 18:09 tobitege

OpenHands OpenHands copied to clipboard

(feat) Configure fallback llm's in case of rate limit errors

OpenHands
OpenHands copied to clipboard