uv icon indicating copy to clipboard operation
uv copied to clipboard

Add support for retrying connection failures

Open yeswalrus opened this issue 1 year ago • 14 comments
trafficstars

Tested with uv 0.1.41

While looking for workarounds to https://github.com/astral-sh/uv/issues/3512, I experimented with using --no-cache. In our case, this actually increased the failure rate, as we use a local artifactory based pypi mirror which appears rate limited. Using the same bash script as in the linked issue but with --no-cache added, and with significantly more packages, we began observing instances where the artifactory server would be overwhelmed and UV would fail with connection reset by peer.

Under pip, this was also reproducible though it required a larger number of simultaneous processes (~80), but rather that failing outright PIP would retry connection failures. Please add (ideally configurable) support for retrying connection failures when downloading or querying indexes

yeswalrus avatar May 10 '24 17:05 yeswalrus

Can you share more information, e.g., a trace with --verbose to demonstrate what's failing? In general, we do already retry.

charliermarsh avatar May 10 '24 18:05 charliermarsh

This may be an issue with our artifactory server having an artificially low rate limit or something but I'm not entirely sure. Logs here:

Using Python 3.10.12 interpreter at: /usr/bin/python3
Creating virtualenv at: test_venvs/6c38fa92-edb8-4dc3-8c34-6a395504bd9d/
Activate with: source test_venvs/6c38fa92-edb8-4dc3-8c34-6a395504bd9d/bin/activate
Built 6 editables in 1.74s
Resolved 114 packages in 10.33s
error: Failed to download distributions
  Caused by: Failed to fetch wheel: plotly==4.12.0
  Caused by: Failed to extract archive
  Caused by: error decoding response body
  Caused by: request or response body error
  Caused by: error reading a body from connection
  Caused by: Connection reset by peer (os error 104)

yeswalrus avatar May 10 '24 22:05 yeswalrus

Potentially related to #3456

(worth noting that requests are already retried about 3 times by default)

samypr100 avatar May 10 '24 22:05 samypr100

We don't retry on TCP layer e.g. connection reset errors afaik. I think we'll need to add custom middleware to retry these.

If someone could link or summarize the relevant details about what exactly is retried today that'd be really helpful.

zanieb avatar May 11 '24 02:05 zanieb

Fair, I only did a quick look and it seemed it would retry based on reqwest-retry, but is_connect is likely not what I thought it was and io::ErrorKind::ConnectionReset would need to be checked separately.

Relevant uv code is here https://github.com/astral-sh/uv/blob/main/crates/uv-client/src/base_client.rs#L169

samypr100 avatar May 11 '24 02:05 samypr100

Looking at it again, it does seem like it would retry on resets, see retryable_strategy.rs#L192

samypr100 avatar May 11 '24 02:05 samypr100

Complete (slightly obfuscated) log of one such instance captured with --verbose enabled, in case it helps with debugging. It's worth noting, this is a less serious issue - I only start seeing it reliably when I've got ~40 simultaneous venvs being created and --no-cache set, at least with my artificial benchmark, so this is something of an edge case. Still, having some ability to throttle or specify longer timeout windows might be helpful

uv_err.log

yeswalrus avatar May 11 '24 02:05 yeswalrus

Thanks for the links @samypr100, cc @konstin ?

zanieb avatar May 11 '24 12:05 zanieb

FYI, I'm also seeing this same error semi-often in my work environment:

⠙ defusedxml==0.7.1                                                                                                                                                                                                                                                                                                                              error: Failed to download `defusedxml==0.7.1`
  Caused by: request or response body error
  Caused by: error reading a body from connection
  Caused by: Connection reset by peer (os error 104)

I'm not sure the details, but we have some kind of Palo Alto Firewall that is almost certainly causing the issue, running the uv command again almost always ends in success.

I'm going to test setting the new UV_CONCURRENT_DOWNLOADS (https://github.com/astral-sh/uv/pull/3493) to 1 and see if that makes a difference in how often I see this error. Though pip also has issues in my work environment from time to time, so it may just reduce the frequency of issues.

notatallshaw avatar May 30 '24 16:05 notatallshaw

I assume those retries don't apply to streaming operations. Like, they apply to the initial request itself, but probably not the entirety of reading from the stream? We can fix this by adding our own retry around download operations.

charliermarsh avatar Jun 02 '24 18:06 charliermarsh

@charliermarsh is this any work planned around this issue? We reconfigured the UV_CONCURRENT_DOWNLOADS to avoid running into this issue frequently. While it helped, our CI builds still run into connection failures here and there

aybidi avatar Jun 25 '24 15:06 aybidi

Hi @aybidi — we think this is important to address but haven't started working on a fix yet. If anyone is willing to investigate and contribute in the meantime I'm happy to review.

zanieb avatar Jun 25 '24 16:06 zanieb

I'm going to test setting the new UV_CONCURRENT_DOWNLOADS (#3493) to 1 and see if that makes a difference in how often I see this error.

FWIW, for my use case setting this did not help, but one day the firewall just started behaving better and I've not seen the error again.

notatallshaw avatar Jun 25 '24 16:06 notatallshaw

Here's an example of a similar class of error that I observed in a CI build that would be nice to have uv retry automatically:

error: Failed to download distributions
  Caused by: Failed to fetch wheel: torch==2.1.0
  Caused by: Failed to extract archive
  Caused by: request or response body error: error reading a body from connection: stream error received: unspecific protocol error detected
  Caused by: error reading a body from connection: stream error received: unspecific protocol error detected
  Caused by: stream error received: unspecific protocol error detected

kujenga avatar Jul 01 '24 16:07 kujenga

I'm also hitting this issue, it's rather painful when going through a corporate proxy that may have concurrent connection limits, retry manually doesn't always work in CI.

messense avatar Jul 09 '24 07:07 messense

@messense With the latest release, what's the error message, does it fail with retries or without?

konstin avatar Jul 09 '24 08:07 konstin

I'm using uv 0.2.23, the error message is

error: Failed to prepare distributions
  Caused by: Failed to fetch wheel: jaxlib==0.4.16+cuda12.cudnn89
  Caused by: Failed to extract archive
  Caused by: error decoding response body
  Caused by: request or response body error
  Caused by: error reading a body from connection
  Caused by: Connection reset by peer (os error 104)

messense avatar Jul 09 '24 08:07 messense

Judging from the code, reqwest-retry simply won't retry body errors.

messense avatar Jul 10 '24 02:07 messense

Thanks, good catch! Let's fix that.

charliermarsh avatar Jul 10 '24 02:07 charliermarsh

Sounds like the fix would need to be on uv's end: https://github.com/TrueLayer/reqwest-middleware/issues/47#issuecomment-1170955570

benjamin-hodgson avatar Jul 10 '24 11:07 benjamin-hodgson

I put up a patch at #4960 if anyone wants to give it a try.

zanieb avatar Jul 10 '24 14:07 zanieb

I'll take this build for a spin on my testing farm. The networking issues I was suffering were infrequent so I won't be able to say for certain but if it shows up I'll let you know

benjamin-hodgson avatar Jul 10 '24 16:07 benjamin-hodgson

We'll probably release it soon too! Thanks though :)

zanieb avatar Jul 10 '24 16:07 zanieb

That build consistently gets stack overflow errors for me. I'm on Windows.

> uv.exe pip install --verbose cowsay

DEBUG uv 0.2.23
DEBUG Searching for Python interpreter in system path or `py` launcher
DEBUG Found cpython 3.12.3 at `D:\CosmosAnalytics\private\cosmos\Scope\pyscope\.venv\Scripts\python.exe` (virtual environment)
DEBUG Using Python 3.12.3 environment at .venv\Scripts\python.exe
DEBUG Acquired lock for `.venv`
DEBUG At least one requirement is not satisfied: cowsay
DEBUG Using request timeout of 30s
DEBUG Solving with installed Python version: 3.12.3
DEBUG Adding direct dependency: cowsay*
DEBUG No cache entry for: https://pypi.org/simple/cowsay/
DEBUG Searching for a compatible version of cowsay (*)
DEBUG Selecting: cowsay==6.1 (cowsay-6.1-py3-none-any.whl)
DEBUG No cache entry for: https://files.pythonhosted.org/packages/f1/13/63c0a02c44024ee16f664e0b36eefeb22d54e93531630bd99e237986f534/cowsay-6.1-py3-none-any.whl.metadata

thread 'main' has overflowed its stack

benjamin-hodgson avatar Jul 10 '24 17:07 benjamin-hodgson

Are you using a release build?

charliermarsh avatar Jul 10 '24 17:07 charliermarsh

I just grabbed the artifact from the CI build, is that not what I was meant to do?

benjamin-hodgson avatar Jul 10 '24 17:07 benjamin-hodgson

Found this, will try with that env var

benjamin-hodgson avatar Jul 10 '24 18:07 benjamin-hodgson

@zanieb #4960 does not work for me, still gives the same error

installing uv...
done! ✨ 🌟 ✨
  installed package uv 0.2.24, installed using Python 3.10.12
...
...
Resolved 258 packages in 34.54s
error: Failed to prepare distributions
  Caused by: Failed to fetch wheel: nvidia-cudnn-cu12==8.9.2.26
  Caused by: Failed to extract archive
  Caused by: error decoding response body
  Caused by: request or response body error
  Caused by: error reading a body from connection
  Caused by: Connection reset by peer (os error 104)

messense avatar Jul 11 '24 02:07 messense

That's so tragic 😭 okay I'll dig deeper. I'll need to reproduce it with a fake server or something. If anyone has time to poke at a reproduction, let me know!

zanieb avatar Jul 11 '24 03:07 zanieb

If I add a panic!() inside https://github.com/astral-sh/uv/blob/23c6cd774b466924d02de23add7101bcaa7b7c3e/crates/uv-client/src/base_client.rs#L291

then inject a connection reset error using sudo iptables -A OUTPUT -p tcp --dport 3128 -j REJECT --reject-with tcp-reset (our proxy is running on port 3128), the panic never happens, only the Connection reset by peer error message is printed, so my guess is that reqwest-retry or reqwest-middleware can't handle such kind of retry strategy at the moment.

messense avatar Jul 12 '24 06:07 messense