interrupted download reports as hash failure
Description
follow-up to #4930
when a large package download is interrupted on a bad link, pip reports a bad hash instead of the interrupt of the download, this leads to first misidentifying the problem
Collecting $REDACTED
Downloading $REDACTED
━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.8/52.4 MB 51.3 kB/s eta 0:12:13
ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
$REDACTED from $REDACTED#md5=04b4d65eda8bf72ae203d40031aa76a3:
Expected md5 04b4d65eda8bf72ae203d40031aa76a3
Got c66b2d113159da2c6911c475ec00b26f
Expected behavior
pip should report the download as interrupted to indicate the actual problem its a fact that the hash will of course differ, if you hash a subset instead of al lthe data, however the error happens at obtaining the data, so failing at the hash is misleading,
i was earnestly trying to figure where my data had gotten corrupted until i realized that the progress was actually not done
pip version
22.1.1
Python version
3.8
OS
Fedora
How to Reproduce
unfortunately i cannot provide a broken network reproducer quickly
Output
No response
Code of Conduct
- [X] I agree to follow the PSF Code of Conduct.
I wonder why pip treats the download as successfully completed in the first place. Is this a limitation in requests or even urllib3?
pip reads directly from Response.raw.stream and it seems that urllib3 does not raise an error if the connection gets closed while reading chunks. I don't know enough about urllib3 to tell whether it should raise an error or not. However, what pip can do is keep count of the downloaded bytes, compare them to the response's Content-Length header before checking hashes, and let the user know that the download was not successful. That seems like a fairly small change and would prevent confusion for the user. I can open an initial PR, unless you think this should be handled by urllib3.
We're consistently seeing this when downloading whls/artifacts that are ~20MB+. We can look into what's causing the networking flakes but this has been a confusing error that we're regularly seeing. Would be very supportive of this change.
I’m marking this as help wanted since it requires someone that can reliably reproduce this to look into how urllib3 marks the download as complete, and how to perform further sniffing in pip’s networking code to work around this. I would strongly suggest anyone reaching here to attempt to dig deeper into urllib3 to figure out what exactly went wrong and work on a pull request.
https://github.com/psf/requests/issues/4956 perhaps
I've also been seeing this issue occasionally in CI builds and have started to investigate this issue. I've setup an intentionally broken local Flask server to proxy PyPI, but to randomly truncate the response and can reproduce this error.
It's true that this is related to the linked requests/urllib3 enforce_content_length issue, which is resolved as of urllib3 v2.0. Unfortunately, upgrading urllib3 alone is not sufficient to resolve this issue (although upgrading urllib3 does give a better error message). The problem is that, due to the way pip/requests streams the response from urllib3, the urllib3 retry logic which pip depends on is bypassed. This can actually happen in two places:
Tracebacks
Response truncated downloading package
Downloading http://127.0.0.1:5000/files/packages/fa/1a/f191d32818e5cd985bdd3f47a6e4f525e2db1ce5e8150045ca0c31813686/Flask-2.3.2-py3-none-any.whl (96 kB)
ERROR: Exception:
Traceback (most recent call last):
File "./pip/_vendor/urllib3/response.py", line 704, in _error_catcher
yield
File "./pip/_vendor/urllib3/response.py", line 829, in _raw_read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
pip._vendor.urllib3.exceptions.IncompleteRead: IncompleteRead(10 bytes read, 96857 more expected)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "./pip/_internal/cli/base_command.py", line 180, in exc_logging_wrapper
status = run_func(*args)
^^^^^^^^^^^^^^^
File "./pip/_internal/cli/req_command.py", line 248, in wrapper
return func(self, options, args)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_internal/commands/install.py", line 377, in run
requirement_set = resolver.resolve(
^^^^^^^^^^^^^^^^^
File "./pip/_internal/resolution/resolvelib/resolver.py", line 92, in resolve
result = self._result = resolver.resolve(
^^^^^^^^^^^^^^^^^
File "./pip/_vendor/resolvelib/resolvers.py", line 546, in resolve
state = resolution.resolve(requirements, max_rounds=max_rounds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_vendor/resolvelib/resolvers.py", line 397, in resolve
self._add_to_criteria(self.state.criteria, r, parent=None)
File "./pip/_vendor/resolvelib/resolvers.py", line 173, in _add_to_criteria
if not criterion.candidates:
File "./pip/_vendor/resolvelib/structs.py", line 156, in __bool__
return bool(self._sequence)
^^^^^^^^^^^^^^^^^^^^
File "./pip/_internal/resolution/resolvelib/found_candidates.py", line 155, in __bool__
return any(self)
^^^^^^^^^
File "./pip/_internal/resolution/resolvelib/found_candidates.py", line 143, in <genexpr>
return (c for c in iterator if id(c) not in self._incompatible_ids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_internal/resolution/resolvelib/found_candidates.py", line 47, in _iter_built
candidate = func()
^^^^^^
File "./pip/_internal/resolution/resolvelib/factory.py", line 206, in _make_candidate_from_link
self._link_candidate_cache[link] = LinkCandidate(
^^^^^^^^^^^^^^
File "./pip/_internal/resolution/resolvelib/candidates.py", line 293, in __init__
super().__init__(
File "./pip/_internal/resolution/resolvelib/candidates.py", line 156, in __init__
self.dist = self._prepare()
^^^^^^^^^^^^^^^
File "./pip/_internal/resolution/resolvelib/candidates.py", line 225, in _prepare
dist = self._prepare_distribution()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_internal/resolution/resolvelib/candidates.py", line 304, in _prepare_distribution
return preparer.prepare_linked_requirement(self._ireq, parallel_builds=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_internal/operations/prepare.py", line 540, in prepare_linked_requirement
return self._prepare_linked_requirement(req, parallel_builds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_internal/operations/prepare.py", line 611, in _prepare_linked_requirement
local_file = unpack_url(
^^^^^^^^^^^
File "./pip/_internal/operations/prepare.py", line 168, in unpack_url
file = get_http_url(
^^^^^^^^^^^^^
File "./pip/_internal/operations/prepare.py", line 109, in get_http_url
from_path, content_type = download(link, temp_dir.path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_internal/network/download.py", line 147, in __call__
for chunk in chunks:
File "./pip/_internal/cli/progress_bars.py", line 53, in _rich_progress_bar
for chunk in iterable:
File "./pip/_internal/network/utils.py", line 63, in response_chunks
for chunk in response.raw.stream(
File "./pip/_vendor/urllib3/response.py", line 934, in stream
data = self.read(amt=amt, decode_content=decode_content)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_vendor/urllib3/response.py", line 873, in read
data = self._raw_read(amt)
^^^^^^^^^^^^^^^^^^^
File "./pip/_vendor/urllib3/response.py", line 807, in _raw_read
with self._error_catcher():
File "/usr/lib64/python3.11/contextlib.py", line 155, in __exit__
self.gen.throw(typ, value, traceback)
File "./pip/_vendor/urllib3/response.py", line 721, in _error_catcher
raise ProtocolError(f"Connection broken: {e!r}", e) from e
pip._vendor.urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(10 bytes read, 96857 more expected)', IncompleteRead(10 bytes read, 96857 more expected))
Response truncated getting package metadata
http://127.0.0.1:5000 "GET /pypi/simple/flask/ HTTP/1.1" 200 39262
ERROR: Could not install packages due to an OSError.
Traceback (most recent call last):
File "./pip/_vendor/urllib3/response.py", line 704, in _error_catcher
yield
File "./pip/_vendor/urllib3/response.py", line 829, in _raw_read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
pip._vendor.urllib3.exceptions.IncompleteRead: IncompleteRead(10 bytes read, 39252 more expected)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "./pip/_vendor/requests/models.py", line 816, in generate
yield from self.raw.stream(chunk_size, decode_content=True)
File "./pip/_vendor/urllib3/response.py", line 934, in stream
data = self.read(amt=amt, decode_content=decode_content)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_vendor/urllib3/response.py", line 905, in read
data = self._raw_read(amt)
^^^^^^^^^^^^^^^^^^^
File "./pip/_vendor/urllib3/response.py", line 807, in _raw_read
with self._error_catcher():
File "/usr/lib64/python3.11/contextlib.py", line 155, in __exit__
self.gen.throw(typ, value, traceback)
File "./pip/_vendor/urllib3/response.py", line 721, in _error_catcher
raise ProtocolError(f"Connection broken: {e!r}", e) from e
pip._vendor.urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(10 bytes read, 39252 more expected)', IncompleteRead(10 bytes read, 39252 more expected))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./pip/_internal/commands/install.py", line 377, in run
requirement_set = resolver.resolve(
^^^^^^^^^^^^^^^^^
File "./pip/_internal/resolution/resolvelib/resolver.py", line 92, in resolve
result = self._result = resolver.resolve(
^^^^^^^^^^^^^^^^^
File "./pip/_vendor/resolvelib/resolvers.py", line 546, in resolve
state = resolution.resolve(requirements, max_rounds=max_rounds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_vendor/resolvelib/resolvers.py", line 397, in resolve
self._add_to_criteria(self.state.criteria, r, parent=None)
File "./pip/_vendor/resolvelib/resolvers.py", line 173, in _add_to_criteria
if not criterion.candidates:
File "./pip/_vendor/resolvelib/structs.py", line 156, in __bool__
return bool(self._sequence)
^^^^^^^^^^^^^^^^^^^^
File "./pip/_internal/resolution/resolvelib/found_candidates.py", line 155, in __bool__
return any(self)
^^^^^^^^^
File "./pip/_internal/resolution/resolvelib/found_candidates.py", line 143, in <genexpr>
return (c for c in iterator if id(c) not in self._incompatible_ids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_internal/resolution/resolvelib/found_candidates.py", line 44, in _iter_built
for version, func in infos:
File "./pip/_internal/resolution/resolvelib/factory.py", line 279, in iter_index_candidate_infos
result = self._finder.find_best_candidate(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_internal/index/package_finder.py", line 890, in find_best_candidate
candidates = self.find_all_candidates(project_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_internal/index/package_finder.py", line 831, in find_all_candidates
page_candidates = list(page_candidates_it)
^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_internal/index/sources.py", line 134, in page_candidates
yield from self._candidates_from_page(self._link)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_internal/index/package_finder.py", line 791, in process_project_url
index_response = self._link_collector.fetch_response(project_url)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_internal/index/collector.py", line 461, in fetch_response
return _get_index_content(location, session=self.session)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_internal/index/collector.py", line 364, in _get_index_content
resp = _get_simple_response(url, session=session)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_internal/index/collector.py", line 135, in _get_simple_response
resp = session.get(
^^^^^^^^^^^^
File "./pip/_vendor/requests/sessions.py", line 602, in get
return self.request("GET", url, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_internal/network/session.py", line 519, in request
return super().request(method, url, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_vendor/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_vendor/requests/sessions.py", line 747, in send
r.content
File "./pip/_vendor/requests/models.py", line 899, in content
self._content = b"".join(self.iter_content(CONTENT_CHUNK_SIZE)) or b""
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./pip/_vendor/requests/models.py", line 818, in generate
raise ChunkedEncodingError(e)
pip._vendor.requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(10 bytes read, 39252 more expected)', IncompleteRead(10 bytes read, 39252 more expected))
(This second error would be reported as JSONDecodeError in the current version of pip.)
There's some related retry discussion here: https://github.com/urllib3/urllib3/issues/542
In essence, the issue is that (in this specific scenario), pip/requests/urllib3 don't cooperate very well to retry failed requests. I suspect that fixing this issue will require some other changes external to pip.
There's a bunch of moving pieces here, so I'll just outline the steps which I believe are required to resolve these issues.
- https://github.com/pypa/pip/issues/12857, which checks that Content-Length matches the body length.
- https://github.com/pypa/pip/issues/4796 / https://github.com/pypa/pip/pull/11180, which adds some retry functionality into pip for downloading packages. (For downloading files, pip uses urllib3 directly. For other things, pip uses requests.)
- https://github.com/psf/requests/issues/6512 , which would allow pip to retry other failed requests.
any workarounds?