lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Add parallel processing for OpenAI completion models
Add parallel processing for OpenAI completion models
Implements https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py to speed up OpenAI API calls as per #1410
- Makes requests concurrently, to maximize throughput
- Throttles request and token usage, to stay under rate limits (max_tokens_per_minute, max_requests_per_minute)
- Retries failed requests up to {max_attempts} times, to avoid missing data
Gives many X speedup for OpenAI models (dependent on users rate limits)
Uses a 1 token dummy request to get a user & model specific Token Per Minute (TPM) rate limit. Requests Per Minute (RPM) are not available programmatically for some reason, but I've raised this with openai. Both RPM and TPM can be overridden in the gen_kwargs using max_tokens_per_minute and max_requests_per_minute. Not sure if/where that should be documented?
Just implemened for openai chat completions since completions is now legacy, but easy to also do this for completions. Also just separated the local model call from the OpenAI model call as not sure people would want to do async/parallel for a local model?
Also, I follow the openai example and cache requests to jsonl file. I cache in temp and do clean up after, but please let me know if it would be better just doing it in memory (I'm not certain of the size of all of the evaluations).
Thanks very much for this PR! I will try to review it as soon as I can.
Hi @pbevan1 , This PR looks quite good to me. Left some comments that you can easily address. Some few general thoughts that I had. @haileyschoelkopf please feel free to add in/chip in.
- Can you add some unit tests for your core logic?
- Move constants to a constant file and then import them to the file. (Have added comments about this)
- Your logic functions are big. Is it possible to chunk them down to smaller functions?
I cache in temp and do clean up after, but please let me know if it would be better just doing it in memory
AFAIK, the lm-eval-harness does not support any parallelization. Is it possible for you to do a stress test or an evaluation for both the approaches and then put up numbers here? It will be a great exercise and will help us make an informed decision as to how we want to proceed ahead with supporting these kind of operations for other LM models as well. (Again, @haileyschoelkopf please feel free to correct me here, I might be wrong)
I believe adding tests should be a priority here.
Any plans on resolving conflicts, adding changes/updates and merging before the new release?
Bump - this would be a really useful feature that would massively speed up eval time
Is this still needed? Since this is closed https://github.com/EleutherAI/lm-evaluation-harness/issues/1095
@baberabb -- are there some indicative speedup numbers via using batching and/or concurrency w/ #2008 that can be shared here before we close?
@baberabb -- are there some indicative speedup numbers via using batching and/or concurrency w/ #2008 that can be shared here before we close?
yeah. For example with defaults I get 0:23 for 40 requests to openai and 0.02 with num_concurrent=10. num_concurrent controls the number of parallel I/O connections (uses aiohttp under the hood).