lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

Add parallel processing for OpenAI completion models

Open pbevan1 opened this issue 1 year ago • 5 comments

Add parallel processing for OpenAI completion models

Implements https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py to speed up OpenAI API calls as per #1410

  • Makes requests concurrently, to maximize throughput
  • Throttles request and token usage, to stay under rate limits (max_tokens_per_minute, max_requests_per_minute)
  • Retries failed requests up to {max_attempts} times, to avoid missing data

Gives many X speedup for OpenAI models (dependent on users rate limits)

Uses a 1 token dummy request to get a user & model specific Token Per Minute (TPM) rate limit. Requests Per Minute (RPM) are not available programmatically for some reason, but I've raised this with openai. Both RPM and TPM can be overridden in the gen_kwargs using max_tokens_per_minute and max_requests_per_minute. Not sure if/where that should be documented?

Just implemened for openai chat completions since completions is now legacy, but easy to also do this for completions. Also just separated the local model call from the OpenAI model call as not sure people would want to do async/parallel for a local model?

Also, I follow the openai example and cache requests to jsonl file. I cache in temp and do clean up after, but please let me know if it would be better just doing it in memory (I'm not certain of the size of all of the evaluations).

pbevan1 avatar Feb 22 '24 19:02 pbevan1

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Feb 22 '24 19:02 CLAassistant

Thanks very much for this PR! I will try to review it as soon as I can.

haileyschoelkopf avatar Feb 26 '24 23:02 haileyschoelkopf

Hi @pbevan1 , This PR looks quite good to me. Left some comments that you can easily address. Some few general thoughts that I had. @haileyschoelkopf please feel free to add in/chip in.

  1. Can you add some unit tests for your core logic?
  2. Move constants to a constant file and then import them to the file. (Have added comments about this)
  3. Your logic functions are big. Is it possible to chunk them down to smaller functions?

I cache in temp and do clean up after, but please let me know if it would be better just doing it in memory

AFAIK, the lm-eval-harness does not support any parallelization. Is it possible for you to do a stress test or an evaluation for both the approaches and then put up numbers here? It will be a great exercise and will help us make an informed decision as to how we want to proceed ahead with supporting these kind of operations for other LM models as well. (Again, @haileyschoelkopf please feel free to correct me here, I might be wrong)

I believe adding tests should be a priority here.

sanchit-ahuja avatar Mar 03 '24 08:03 sanchit-ahuja

Any plans on resolving conflicts, adding changes/updates and merging before the new release?

LSinev avatar Jun 13 '24 08:06 LSinev

Bump - this would be a really useful feature that would massively speed up eval time

Peter-Devine avatar Jun 28 '24 02:06 Peter-Devine

Is this still needed? Since this is closed https://github.com/EleutherAI/lm-evaluation-harness/issues/1095

LSinev avatar Aug 20 '24 09:08 LSinev

@baberabb -- are there some indicative speedup numbers via using batching and/or concurrency w/ #2008 that can be shared here before we close?

haileyschoelkopf avatar Aug 20 '24 14:08 haileyschoelkopf

@baberabb -- are there some indicative speedup numbers via using batching and/or concurrency w/ #2008 that can be shared here before we close?

yeah. For example with defaults I get 0:23 for 40 requests to openai and 0.02 with num_concurrent=10. num_concurrent controls the number of parallel I/O connections (uses aiohttp under the hood).

baberabb avatar Aug 21 '24 11:08 baberabb