openai-python `httpx.PoolTimeout` occurs frequently with SyncClient

Confirm this is an issue with the Python library and not an underlying OpenAI API

[X] This is an issue with the Python library

Describe the bug

httpx.PoolTimeout occurs frequently with SyncClient

Recently, we noticed a high number of timeouts. Many requests were getting stuck on the default timeout of 600. This was before we migrated. We migrated to v1.2.3 to try to mitigate this but the requests were still getting stuck in timeout. We have managed to mitigate this a little bit by setting the timeout to 30 seconds and retrying (without our own retry library since the OpenAI retries don't appear to have jitter or exp backoff and were causing problems at scale) Now we are getting httpx.PoolTimeout when using the SyncClient. This is causing downstream issues since tasks start to pile up and we just get tons of httpx.PoolTimeout.

I think we will consider using a custom http client, though I noticed this requests being stuck in timeout on the old version of the api as well... which was our original motivation to migrate...

In case it helps this is in a production app doing about 3-6 OpenAI requests per second and seems to line up with busier traffic moments.

To Reproduce

Use SyncClient
Make 3-6 requests per second to ChatCompletions endpoint
Get httpx.PoolTimeouts

Code snippets

No response

OS

ubuntu

Python version

Python v3.10.8

Library version

OpenAI v1.2.4

Nov 15 '23 01:11 domenicrosati

I actually think this is probably just a matter of the default client not working well for scale.

We are going to try the following as a custom client.

DEFAULT_TIMEOUT = httpx.Timeout(
    timeout=OPENAI_TIMEOUT,
    connect=OPENAI_TIMEOUT,
    pool=OPENAI_TIMEOUT
)
DEFAULT_LIMITS = httpx.Limits(
    max_connections=500,
    max_keepalive_connections=100
)

OpenAIHTTPClient = httpx.Client(
    timeout=DEFAULT_TIMEOUT,
    limits=DEFAULT_LIMITS
)

If it works I can just close but maybe its good to be clear in migration or in the docs that the client should be configured for scale.

I am still concerned about requests getting stuck in the maximum timeout once and a while which doesn't appear to be related to the python client since it was happening before we migrated.

Nov 15 '23 01:11 domenicrosati

Closing! This seemed to resolve the issue - would still be great if you folks could look into the requests getting stuck in the timeout issue.

Nov 15 '23 16:11 domenicrosati

Thank you so much Domenic! I agree we should update our defaults here. I appreciate you sharing the ones that worked for you, we may use those as a starting point! (Please let us know here if you find that these limits aren't ideal and that you'd suggest something else).

Do you know of any reason not to go even higher on the max_connections or max_keepalive_connections settings?

Nov 15 '23 17:11 rattrayalex

Actually I'm going to reopen this because while there is a workaround, I agree that our defaults should be better and I'd like to track that.

cc @RobertCraigie

Nov 15 '23 17:11 rattrayalex

I am still concerned about requests getting stuck in the maximum timeout once and a while which doesn't appear to be related to the python client since it was happening before we migrated.

@domenicrosati are you using the synchronous client when these timeouts occur? We've been getting reports of issues with the asynchronous client but this would be the first with the synchronous version.

From other reports users have mentioned that downgrading to the old version fixed their issues... how frequently were you seeing these timeouts in the v0 SDK?

Nov 15 '23 17:11 RobertCraigie

@RobertCraigie - yes using the Synchronous client and no downgrading does not fix the issue. It appears to be the same timeout rate for v0 and v1 of about 1 in 10 requests timing out.

And this is for non-pool timeouts - this is just for regular readtimeouts

Nov 16 '23 11:11 domenicrosati

BTW the pool timeouts appeared again on those settings so I had to increase them even more

Nov 16 '23 12:11 domenicrosati

Thanks @domenicrosati, what did you bump the pool limit too? Additionally, what timeout are you using?

We have a pretty long timeout by default which, especially if your API calls tend to be quick, will exacerbate the pool issue due to the issue reported in #769. So I would recommend lowering it if you can.

Nov 16 '23 21:11 RobertCraigie

25k 10sessions

On Thu, Nov 16, 2023 at 1:12 PM Robert Craigie @.***> wrote:

Thanks @domenicrosati https://github.com/domenicrosati, what did you bump the pool limit too? Additionally, what timeout are you using?

We have a pretty long timeout by default which, especially if your API calls tend to be quick, will exacerbate the pool issue due to the issue reported in #769 https://github.com/openai/openai-python/issues/769. So I would recommend lowering it if you can.

— Reply to this email directly, view it on GitHub https://github.com/openai/openai-python/issues/821#issuecomment-1815321974, or unsubscribe https://github.com/notifications/unsubscribe-auth/BD4QLGVXOCZJS7RAAXG5BWTYEZ6URAVCNFSM6AAAAAA7LWOAZ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJVGMZDCOJXGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Nov 16 '23 21:11 GCODIN