uv icon indicating copy to clipboard operation
uv copied to clipboard

Increase or adjust scope of default HTTP timeout

Open zanieb opened this issue 1 year ago • 6 comments

Our current HTTP timeout is not sufficient for some large packages e.g. torch

  • #1920
  • #1912

Increasing the timeout has a downside for people that have network problems while downloading smaller packages, e.g. it will wait longer before surfacing a problem.

What does pip use for a default timeout?

zanieb avatar Feb 23 '24 17:02 zanieb

I would expect that a server only sends an HTTP Timeout response if the connection is idle for too long, which makes me wonder if increasing the timeout solves the problem or only hides the root cause.

MichaReiser avatar Feb 23 '24 17:02 MichaReiser

Yeah good point, I wonder if we're setting the wrong timeout? We shouldn't be enforcing this timeout when data transfer is actively occurring.

zanieb avatar Feb 23 '24 17:02 zanieb

FWIW the default in PIP is 15 seconds https://github.com/pypa/pip/blob/main/src/pip/_internal/cli/cmdoptions.py#L294, so I'm not sure uv's 30? second default is at fault.

samypr100 avatar Feb 24 '24 19:02 samypr100

Could it be that we need to reduce the number of parallel requests? It could also be that they timeout because some cpu intensive task is blocking the main thread, esp. with something large like pytorch.

konstin avatar Feb 26 '24 09:02 konstin

It seems like the next steps are to:

  1. Determine what exactly the timeout we are configuring applies to
  2. Create a simple MRE
  3. Determine what is happening with the request when the error is raised

zanieb avatar Feb 26 '24 23:02 zanieb

See https://github.com/seanmonstar/reqwest/issues/2237

It looks like we're using a deadline for the full request not a read timeout as desired. We need this functionality to be added upstream.

zanieb avatar Apr 05 '24 22:04 zanieb

See https://github.com/seanmonstar/reqwest/pull/2241

zanieb avatar Apr 12 '24 19:04 zanieb

Hi @zanieb! When can we expect this to be fixed?

huggingface/datasets CI is quite consistently failing because of this issue (I'd assume) as of yesterday (e.g., check the "Install dependencies" step of this CI run).

mariosasko avatar Apr 16 '24 00:04 mariosasko

@mariosasko - How often are you seeing this? We'll ship the HTTP timeout change soon (this week for sure), but I'm still wondering if there's something else going on here and would use huggingface/datasets to investigate if it's at least somewhat frequent.

charliermarsh avatar Apr 16 '24 00:04 charliermarsh

@charliermarsh It's pretty consistent, e.g., see the first two commit's CI runs in https://github.com/huggingface/datasets/pull/6811 (it's more likely to happen than not based on recent CI runs).

PS: Not sure if this info is valuable, but both Windows and Ubuntu GH runners are susceptible to this issue

mariosasko avatar Apr 16 '24 01:04 mariosasko

OK thanks. I'm gonna do some testing using that CI workflow.

charliermarsh avatar Apr 16 '24 01:04 charliermarsh

So the first issue I ran into is that Windows sometimes fails with pytest missing, because the "Install dependencies" step silently fails:

error: Failed to download: scikit-learn==0.23.1
  Caused by: HTTP status server error (503 Service Unavailable) for url (https://files.pythonhosted.org/packages/7e/e5/888491b7e2c16718a68dfd8498325e8927003410b2d19ba255d875[13](https://github.com/charliermarsh/datasets/actions/runs/8708243419/job/23885167918?pr=1#step:7:14)38a5/scikit_learn-0.23.1-cp38-cp38-win_amd64.whl.metadata)
Resolved 4 packages in 41ms
Downloaded 4 packages in 2.54s
Installed 4 packages in 28ms
 + bleurt==0.0.2 (from git+https://github.com/google-research/bleurt.git@cebe7e6f996b40910cfaa520a63db47807e3bf5c)
 + coval==0.0.1 (from git+https://github.com/ns-moosavi/coval.git@87071a6293dc2e786dcfe2ed78e9369c[17](https://github.com/charliermarsh/datasets/actions/runs/8708243419/job/23885167918?pr=1#step:7:18)e41b3b)
 + math-equivalence==0.0.0 (from git+https://github.com/hendrycks/math.git@357963a7f5501a6c1708cf3f3fb0cdf525642761)
 + unbabel-comet==2.2.2

We can make that not silent by changing the shell configuration (so that it exits as soon as one command fails, rather than proceeding). But I'm not sure why it's failing in the first place. 503 hitting that static endpoint?

charliermarsh avatar Apr 16 '24 16:04 charliermarsh