`httpx.PoolTimeout` occurs frequently with SyncClient
Confirm this is an issue with the Python library and not an underlying OpenAI API
- [X] This is an issue with the Python library
Describe the bug
httpx.PoolTimeout occurs frequently with SyncClient
Recently, we noticed a high number of timeouts. Many requests were getting stuck on the default timeout of 600.
This was before we migrated.
We migrated to v1.2.3 to try to mitigate this but the requests were still getting stuck in timeout.
We have managed to mitigate this a little bit by setting the timeout to 30 seconds and retrying (without our own retry library since the OpenAI retries don't appear to have jitter or exp backoff and were causing problems at scale)
Now we are getting httpx.PoolTimeout when using the SyncClient. This is causing downstream issues since tasks start to pile up and we just get tons of httpx.PoolTimeout.
I think we will consider using a custom http client, though I noticed this requests being stuck in timeout on the old version of the api as well... which was our original motivation to migrate...
In case it helps this is in a production app doing about 3-6 OpenAI requests per second and seems to line up with busier traffic moments.
To Reproduce
- Use SyncClient
- Make 3-6 requests per second to ChatCompletions endpoint
- Get httpx.PoolTimeouts
Code snippets
No response
OS
ubuntu
Python version
Python v3.10.8
Library version
OpenAI v1.2.4
I actually think this is probably just a matter of the default client not working well for scale.
We are going to try the following as a custom client.
DEFAULT_TIMEOUT = httpx.Timeout(
timeout=OPENAI_TIMEOUT,
connect=OPENAI_TIMEOUT,
pool=OPENAI_TIMEOUT
)
DEFAULT_LIMITS = httpx.Limits(
max_connections=500,
max_keepalive_connections=100
)
OpenAIHTTPClient = httpx.Client(
timeout=DEFAULT_TIMEOUT,
limits=DEFAULT_LIMITS
)
If it works I can just close but maybe its good to be clear in migration or in the docs that the client should be configured for scale.
I am still concerned about requests getting stuck in the maximum timeout once and a while which doesn't appear to be related to the python client since it was happening before we migrated.
Closing! This seemed to resolve the issue - would still be great if you folks could look into the requests getting stuck in the timeout issue.
Thank you so much Domenic! I agree we should update our defaults here. I appreciate you sharing the ones that worked for you, we may use those as a starting point! (Please let us know here if you find that these limits aren't ideal and that you'd suggest something else).
Do you know of any reason not to go even higher on the max_connections or max_keepalive_connections settings?
Actually I'm going to reopen this because while there is a workaround, I agree that our defaults should be better and I'd like to track that.
cc @RobertCraigie
I am still concerned about requests getting stuck in the maximum timeout once and a while which doesn't appear to be related to the python client since it was happening before we migrated.
@domenicrosati are you using the synchronous client when these timeouts occur? We've been getting reports of issues with the asynchronous client but this would be the first with the synchronous version.
From other reports users have mentioned that downgrading to the old version fixed their issues... how frequently were you seeing these timeouts in the v0 SDK?
@RobertCraigie - yes using the Synchronous client and no downgrading does not fix the issue. It appears to be the same timeout rate for v0 and v1 of about 1 in 10 requests timing out.
And this is for non-pool timeouts - this is just for regular readtimeouts
BTW the pool timeouts appeared again on those settings so I had to increase them even more
Thanks @domenicrosati, what did you bump the pool limit too? Additionally, what timeout are you using?
We have a pretty long timeout by default which, especially if your API calls tend to be quick, will exacerbate the pool issue due to the issue reported in #769. So I would recommend lowering it if you can.
25k 10sessions
On Thu, Nov 16, 2023 at 1:12 PM Robert Craigie @.***> wrote:
Thanks @domenicrosati https://github.com/domenicrosati, what did you bump the pool limit too? Additionally, what timeout are you using?
We have a pretty long timeout by default which, especially if your API calls tend to be quick, will exacerbate the pool issue due to the issue reported in #769 https://github.com/openai/openai-python/issues/769. So I would recommend lowering it if you can.
— Reply to this email directly, view it on GitHub https://github.com/openai/openai-python/issues/821#issuecomment-1815321974, or unsubscribe https://github.com/notifications/unsubscribe-auth/BD4QLGVXOCZJS7RAAXG5BWTYEZ6URAVCNFSM6AAAAAA7LWOAZ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJVGMZDCOJXGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>