uv icon indicating copy to clipboard operation
uv copied to clipboard

Repeated timeouts in GitHub Actions fetching wheel for large packages

Open adamtheturtle opened this issue 1 year ago • 12 comments

In the last few days since switching to uv, I have seen errors that I have not seen before with pip.

I see:

error: Failed to download distributions
  Caused by: Failed to fetch wheel: torch==2.2.1
  Caused by: Failed to extract source distribution
  Caused by: request or response body error: operation timed out
  Caused by: operation timed out
Error: Process completed with exit code 2.

I see this on the CI for vws-python-mock, which requires installing 150 packages:

uv pip install --upgrade --editable .[dev]
...
Resolved 150 packages in 1.65s
Downloaded 141 packages in 21.41s
Installed 150 packages in 283ms

I do this in parallel across many jobs on GitHub Actions, mostly on ubuntu-latest.

This happened with torch 2.2.0 before the recent release of torch 2.2.1. It has not happened with any other dependencies. The wheels for torch are pretty huge: https://pypi.org/project/torch/#files.

uv is always at the latest version as I run curl -LsSf https://astral.sh/uv/install.sh | sh. In the most recent example, this is uv 0.1.9.

Failures:

  • https://github.com/VWS-Python/vws-python-mock/actions/runs/8017894929/job/21902693117
  • https://github.com/VWS-Python/vws-python-mock/actions/runs/8000017693/job/21848747452
  • https://github.com/VWS-Python/vws-python-mock/actions/runs/8000017693/job/21848749557

adamtheturtle avatar Feb 23 '24 13:02 adamtheturtle

Perhaps I just need to use UV_HTTP_TIMEOUT and I will, but I thought that this might be worth pointing out:

  • If so, the error message could helpfully point to UV_HTTP_TIMEOUT
  • Perhaps the default is too small if using GitHub Actions + a popular package times out

adamtheturtle avatar Feb 23 '24 13:02 adamtheturtle

Thanks for the feedback, I've opened issues for your requests

  • https://github.com/astral-sh/uv/issues/1921
  • https://github.com/astral-sh/uv/issues/1922

zanieb avatar Feb 23 '24 17:02 zanieb

Thank you @zanieb ! I don't know the value of having this issue open, but I'll leave it to you to close if desired.

adamtheturtle avatar Feb 23 '24 17:02 adamtheturtle

In #1921 my co-worker noted that this might be a bug in the way we're specifying the timeout so I'll recategorize this one and leave it open.

zanieb avatar Feb 23 '24 17:02 zanieb

Looking at the actions runs, all the passing actions take ~30s, while the failing ones error after 5min, which is our default timeout, so this looks like a network failure (in either github actions or rust)

konstin avatar Feb 28 '24 14:02 konstin

I'm not seeing any timeouts anymore with the two most recent versions (https://github.com/konstin/vws-python-mock/actions). Could you check if this now solved?

konstin avatar Mar 01 '24 11:03 konstin

I have not seen this issue since posting. Thank you for looking into this.

adamtheturtle avatar Mar 01 '24 11:03 adamtheturtle

I'll close it for now, please feel free to reopen should it reoccur

konstin avatar Mar 01 '24 11:03 konstin

@konstin I do not have permissions to re-open this issue. I can create a new one, but it is probably easier if you re-open this.

This failure has reoccurred:

  • https://github.com/VWS-Python/vws-python-mock/actions/runs/8134588970/job/22227737289
  • https://github.com/VWS-Python/vws-python-mock/actions/runs/8134588970/job/22227737596

adamtheturtle avatar Mar 04 '24 01:03 adamtheturtle

I'm seeing very similar error message for non pytorch package that's also pretty large. It's ~400 MB wheel and consistently gives me,

(bento_uv2) pa-loaner@C02DVAQNMD6R training-platform % uv pip install --index-url=$REGISTRY_INDEX data-mesh-cli==0.0.66
error: Failed to download: data-mesh-cli==0.0.66
  Caused by: The wheel data_mesh_cli-0.0.66-py3-none-any.whl is not a valid zip file
  Caused by: an upstream reader returned an error: request or response body error: operation timed out
  Caused by: request or response body error: operation timed out
  Caused by: operation timed out

Package is company internal one though, but I think only notable thing is very large size (it vendors spark/java stuff).

edit: Pytorch weirdly installs fine for me pretty fast.

hmc-cs-mdrissi avatar Mar 05 '24 03:03 hmc-cs-mdrissi

I have changed the title of this to not reference torch. It recently happened with nvidia-cudnn-cu12, another large download.

As another example, https://github.com/VWS-Python/vws-python-mock/actions/runs/8262236134 has 7 failures in one run.

adamtheturtle avatar Mar 13 '24 09:03 adamtheturtle

It can happen on Read the Docs as well, not only GHA https://beta.readthedocs.org/projects/kedro-datasets/builds/23790543/

astrojuanlu avatar Mar 18 '24 20:03 astrojuanlu

Spotted it locally today inside a local Docker image running under QEMU

error: Failed to download distributions
  Caused by: Failed to fetch wheel: nvidia-cublas-cu12==12.1.3.1
  Caused by: Failed to extract archive
  Caused by: Failed to download distribution due to network timeout. Try increasing UV_HTTP_TIMEOUT (current value: 300s).

astrojuanlu avatar Mar 24 '24 06:03 astrojuanlu

I encountered the problem when I used either uv or pip to download large wheels (for pip, the issue is https://github.com/pypa/pip/issues/4796 and https://github.com/pypa/pip/issues/11153), so I think the root cause is the network. However, I am wondering if uv can be smarter to retry automatically, like something in https://github.com/pypa/pip/pull/11180.

njzjz avatar Apr 19 '24 22:04 njzjz

Worth trying 0.1.35, which includes https://github.com/astral-sh/uv/pull/3144

astrojuanlu avatar Apr 20 '24 06:04 astrojuanlu

It seems likely that this is resolved by #3144

zanieb avatar Apr 21 '24 15:04 zanieb

I encountered the problem when I used either uv or pip to download large wheels (for pip, the issue is https://github.com/pypa/pip/issues/4796 and https://github.com/pypa/pip/issues/11153), so I think the root cause is the network. However, I am wondering if uv can be smarter to retry automatically, like something in https://github.com/pypa/pip/pull/11180.

that would be a great feature. we have our dev environments behind TLS inspection and some packages often run into a timeout due too slow inspection. we can reproduce this with a browser and the download gets stuck until a timeout. in the browser we can just click resume and the browser reconnects snd downloads the remaining part. with uv we don't have a retry with resume. so it starts from scratch and gets stuck again.

+1 for retry with resume

OneCyrus avatar Apr 25 '24 20:04 OneCyrus

Going to close for now, but we can re-open if this comes up again post-changing the timeout semantics.

charliermarsh avatar May 01 '24 16:05 charliermarsh