ragas icon indicating copy to clipboard operation
ragas copied to clipboard

Feature suggestion - Handling LLM quotas when evaluating

Open 0ENZO opened this issue 1 year ago • 7 comments

It would be nice to handle LLM quotas when evaluating a large dataset, for my personal case I cannot increase the default 60 request per min for VertexAI LLM.

Tracking llm calls for the current minute within .evaluate() might sound a bit overkill. Offering the possibility to set a time.sleep() between each sample might do the trick.

I don't what know what you guys think. I am the only one to encounter such a problem ?

0ENZO avatar Feb 05 '24 11:02 0ENZO

that is a great idea @0ENZO ! we use tenacity underneath the hood and have this to configure https://github.com/explodinggradients/ragas/blob/main/src/ragas/run_config.py things like this

I'll add sleep to it and that should help you

jjmachan avatar Feb 05 '24 18:02 jjmachan

we have to add something for stats too I guess. so you can see num_tokens, cost, performance figures etc

what do you think about those, have you felt the need for that. If you were to only choose one, which would it be?

jjmachan avatar Feb 05 '24 18:02 jjmachan

Sounds good, thanks !

Regarding performance figures, num_tokens.. I haven't had any such needs yet

0ENZO avatar Feb 06 '24 08:02 0ENZO

hey @0ENZO so after thinking about it a bit more, it seems like a more complicated solution to implement because of how we have things setup.

The core problem here is the contention of resources, we could have fixed it in 2 ways

  1. collect all the LLM and embedding calls ragas makes and implement something like a leaky bucket so that the #requests per minute is a constant
  2. Exponential Backoff as explained here. We went with this.

so the solution today is configuring the exponential backoff for 60 requests per minute. Right now I don't have a good formula for that but that is something we could find right?

so the solution for your problem today is configuring the RunConfig with the correct max_retries and max_wait (and maybe some more, I'll look into that) but what do you think?

jjmachan avatar Feb 06 '24 20:02 jjmachan

also I'm doing some experiments so that I can get you unblocked without much hastle

jjmachan avatar Feb 06 '24 23:02 jjmachan

Do you have a suggestion that I could implement now? I am exceeding my azure gpt4 rate limit of 80k tokens per minute when evaluating 48 questions/answers for all metrics. Is there a way to rate limit the evaluation? Perhaps I should pull out some metrics?

klangst-ETR avatar Feb 12 '24 18:02 klangst-ETR

@jjmachan, any suggestions on how we should set the run config? I am also facing this issue with ragas 0.1.7

bdeck8317 avatar Apr 10 '24 02:04 bdeck8317

we have to add something for stats too I guess. so you can see num_tokens, cost, performance figures etc

what do you think about those, have you felt the need for that. If you were to only choose one, which would it be?

May I ask if there is any plan for this part

xiaochaohit avatar Jul 11 '24 09:07 xiaochaohit

will be fixed with #1156 for documentation on run_config check out Understand Cost and Usage of Operations | Ragas for how to figure out cost

hope this helps @xiaochaohit @bdeck8317 @klangst-ETR

jjmachan avatar Aug 02 '24 06:08 jjmachan