feature request: proactive client-side rate limiting
Confirm this is a feature request for the Python library and not the underlying OpenAI API.
- [X] This is a feature request for the Python library
Describe the feature or improvement you're requesting
When making batch requests using LangChain, with an OpenAI model as shown in this minimal repro, it is common to hit the organizational rate limit for tokens per minute (TPM) - as demonstrated in this error log.
Whilst limiting the concurrency of batches, and introducing exponential backoff can be used to reduce this issue downstream in LangChain - I believe there is also room for the OpenAI#request function in this library to more intelligently handle parallel invocations so as to better support batch requests, regardless of whether this library, langchain or another codebase is responsible for initiating the batch requests.
In particular, I would suggest that the SyncAPIClient create queue(s) of requests and determine when enqueued requests can be run based on the x-ratelimit-* and retry-after headers of existing responses.
Additional context
Related to https://github.com/openai/openai-python/issues/937#issuecomment-1866784701