pip icon indicating copy to clipboard operation
pip copied to clipboard

[Improvement] Pip could resume download package at halfway the connection is poor

Open winstonma opened this issue 6 years ago • 26 comments

  • Pip version: 9.0.1
  • Python version: 3.6.2
  • Operating system: macOS 10.13

Description

When I have poor internet connection (the network is cut unexpectedly), updating pip package is painful. When I retry the pip install, it would stop at the midpoint and give me the same md5 error.

All I have to do is

  1. Download the package from pypi (using browser or wget, both have retry/resume capability)
  2. Pip install
  3. remove the package

If pip download have resume feature then the problem could be solved.

What I've run

pip install -U jupyterlab in poor network condition

Collecting jupyterlab
  Downloading jupyterlab-0.28.4-py2.py3-none-any.whl (8.7MB)
    4% |█▋                              | 430kB 1.1MB/s eta 0:00:08
THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
    jupyterlab from https://pypi.python.org/packages/b1/6d/d1d033186a07e08af9dc09db41401af7d6e18f98b73bd3bef75a1139dd1b/jupyterlab-0.28.4-py2.py3-none-any.whl#md5=9a93b1dc85f5924151f0ae9670024bd0:
        Expected md5 9a93b1dc85f5924151f0ae9670024bd0
             Got        4b6835257af9609a227a72b18ea011e3

winstonma avatar Oct 20 '17 05:10 winstonma

I don't know how pip's hashing works, but here's some working, easy, simple, modular resume code in a single file/function: https://gist.github.com/CTimmerman/ccf884f8c8dcc284588f1811ed99be6c

CTimmerman avatar Jul 09 '18 12:07 CTimmerman

I have a poor connection and I often resume pip manually using wget.

This is easy for wheel using wget -c, then you can install the wheel with pip, but when it's a tarball I have to use the setup script and don't get the same result even though in the end it works.

seandepagnier avatar Mar 24 '19 21:03 seandepagnier

This should be easier to implement now since all the logic regarding downloads is isolated in pip._internal.network.download.

chrahunt avatar Jan 25 '20 03:01 chrahunt

Any updates on this? I was installing a huge package (specifically Tensorflow, 500+MB), and for some reason pip was killed in 99% of download... Re run the command and it started downloading from 0...

johny65 avatar May 14 '20 19:05 johny65

@johny65 No updates.

Folks are welcome to contribute this functionality to pip. As noted by @chrahunt, there's a clear part of the codebase for these changes to be made in. :)

pradyunsg avatar May 14 '20 19:05 pradyunsg

I have a few questions about the design for this enhancement. First, why (or how) does this happen?

When I have poor internet connection (the network is cut unexpectedly) [...] When I retry the pip install, it would stop at the midpoint and give me the same md5 error.

My guess would be that back then the wheels are stored directly to the cache dir instead of being downloaded to a temporary location like it is handled now. Thus the hashing error should be solved.

However, because the wheel being downloaded is in a directory that will be cleaned up afterward, do we want to expose that mechanism to be configurable (e.g. pip install --wheel-dir=<user-assgned path> <packages>), or do we want to offer the last result for people with poor connections to pip download -d <user-assgned path> <packages> then pip install? Personally I prefer the latter approach, where we'd need to make pip download download directly to the specified dir, and I'm not sure if doing that would break any existing use case.

McSinyx avatar May 15 '20 09:05 McSinyx

Any updates on this? I was installing a huge package (specifically Tensorflow, 500+MB), and for some reason pip was killed in 99% of download... Re run the command and it started downloading from 0...

same with pytorch which was 1 GB in size. days quota just got exhausted and no fruitful result.

ShashankAW avatar Oct 22 '20 04:10 ShashankAW

fwiw, you can always curl manually (applying the resuming logic you need and checking the integrity manually) and pip install the downloded file instead.

uranusjr avatar Oct 22 '20 05:10 uranusjr

Folks are welcome to contribute this functionality to pip.

pradyunsg avatar Dec 01 '20 00:12 pradyunsg

Folks are welcome to contribute this functionality to pip.

I'd like to give this a try and created a proof of concept PR here: https://github.com/pypa/pip/pull/11180.

I'm not quite sure what the command line options will look like for this feature. I imagine we will need new options to turn on/off this feature and limit the number of retries (this is different from the --retries switch). So maybe use --resume-incomplete-download to opt-in and --resume-attempts to set the limit?

yichi-yang avatar Jun 10 '22 22:06 yichi-yang

If this gets implemented, I would want it to be enabled by default, and fallback automatically to the previous implementation if resuming is not successful (e.g. if the server does not support resuming). This matches the behaviour of normal downloading clients e.g. web browsers.

uranusjr avatar Jun 11 '22 07:06 uranusjr

If this gets implemented, I would want it to be enabled by default, and fallback automatically to the previous implementation if resuming is not successful (e.g. if the server does not support resuming). This matches the behaviour of normal downloading clients e.g. web browsers.

How about the number of attempts? Should we keep making new requests as long as the responses have successful status code (e.g. 200) and non-empty bodies (some progress is made in each request)?

yichi-yang avatar Jun 11 '22 18:06 yichi-yang

Instead of trying to guess how many attempts is reasonable, perhaps pip should store the incomplete download somewhere (e.g. in cache?) and resume it on the next pip install. This also better matches browser behaviour—the download is not re-attempted automatically, but the user can click a button to resume.

uranusjr avatar Jun 12 '22 06:06 uranusjr

If-Unmodified-Since should ensure it's the same file, safe to resume. https://gist.github.com/CTimmerman/ccf884f8c8dcc284588f1811ed99be6c

CTimmerman avatar Jun 12 '22 12:06 CTimmerman

Instead of trying to guess how many attempts is reasonable, perhaps pip should store the incomplete download somewhere (e.g. in cache?) and resume it on the next pip install. This also better matches browser behaviour—the download is not re-attempted automatically, but the user can click a button to resume.

Currently pip uses CacheControl to handle HTTP caching, but it doesn't cache responses with incomplete bodies (or Range requests with status code 206) so it doesn't help with our case (incomplete download). It seems to me that to implement a cache independent of existing HTTP and wheel caching for the sole purpose of resuming failed download will be a lot of work.

Also I'm not sure if the browser behavior is desirable in this case. With large wheels (e.g. pytorch > 2 GB) and my crappy Internet it consistently fails 4~5 times before completing. If users are installing many large packages (e.g. from a requirements.txt) having to manually resume multiple times can be annoying. That's why I think opt-in might work better. In most cases resuming is not required, but in the case it does we can present a warning informing the users that 1) the download is incomplete, and 2) they can use some command line option to automatically resume download next time.

yichi-yang avatar Jun 12 '22 18:06 yichi-yang

One caveat with trying to mimck the browser is that, unlike the browser's UI which lets the user cancel / pause / resume any specific download, pip doesn't have such a rich user interface via the CLI.

We'd need to, at least, provide one knob for this resuming behaviour -- either to opt-in or opt-out. I think when you're not in "resume my downloads" mode, pip should also clean up any existing incomplete downloads.

That said, picking between opt-in vs opt-out is not really blocker to needing to implement either behaviours. It's a matter of changing a flag's default value in the PR (let's use a flag with values like --incomplete-downloads=resume/discard for handling this) which is easy-enough. :)

pradyunsg avatar Jun 12 '22 20:06 pradyunsg

I think my PR https://github.com/pypa/pip/pull/11180 is ready for a first round of review. Suggestions for more meaningful flag names, log messages, and exception messages are welcome.

yichi-yang avatar Jul 17 '22 04:07 yichi-yang

Having the same problem downloading pytorch + open-cv on a streamlit project for the third time today (connection lost after 6 hours...), I wonder if making pip able to use an external downloader could be a thing ? yt-dl provides :

    --external-downloader COMMAND        Use the specified external downloader.
                                         Currently supports aria2c,avconv,axel,c
                                         url,ffmpeg,httpie,wget
    --external-downloader-args ARGS      Give these arguments to the external
                                         downloader

A kind of pip install --external-downloader wget --external-downloader-args '-r' requirements.txt ?

Rom1deTroyes avatar Oct 21 '22 21:10 Rom1deTroyes

Having the same problem downloading pytorch + open-cv on a streamlit project for the third time today (connection lost after 6 hours...), I wonder if making pip able to use an external downloader could be a thing ? yt-dl provides :

    --external-downloader COMMAND        Use the specified external downloader.
                                         Currently supports aria2c,avconv,axel,c
                                         url,ffmpeg,httpie,wget
    --external-downloader-args ARGS      Give these arguments to the external
                                         downloader

A kind of pip install --external-downloader wget --external-downloader-args '-r' requirements.txt ?

Which of those also works on Windows? Resuming HTTP downloads is simple, as evident by the PR at https://github.com/pypa/pip/pull/11180 which is fine by me, but i feel it's such a basic feature it should be supported upstream.

CTimmerman avatar Oct 22 '22 10:10 CTimmerman

We're not going to be using an external programme for network interaction within pip. This should be implemented as logic within pip itself.

pradyunsg avatar Oct 22 '22 16:10 pradyunsg

What's the progress on this feature? It's annoying trying to install packages like tensorflow and pytorch and then getting errors when the downloads are almost complete

Nneji123 avatar Jan 25 '23 07:01 Nneji123

What's the progress on this feature? It's annoying trying to install packages like tensorflow and pytorch and then getting errors when the downloads are almost complete

I have a proof-of-concept PR here: https://github.com/pypa/pip/pull/11180. It's been a while since I last worked on it, and there has been some discussion about the user interface that I haven't incorporated into the PR.

Personally I feel like the major problems are:

  1. Need to decided to if this is better fixed upstream (though I think parts of the resume logic will have to be handled by pip either case).
  2. What user interface we should use?

I think it will be nice if we can have some input from the maintainers, e.g., priorities, expectations, etc.

yichi-yang avatar Jan 25 '23 07:01 yichi-yang

By upstream do you mean requests? As for which UX to use, I don’t think anyone really expressed strong opinions, but only pointed out things the end product needs to be handle. So the best approach to drive this forward would be to implement what you feel is best and see what people think of it.

uranusjr avatar Jan 31 '23 17:01 uranusjr

By upstream do you mean requests? As for which UX to use, I don’t think anyone really expressed strong opinions, but only pointed out things the end product needs to be handle. So the best approach to drive this forward would be to implement what you feel is best and see what people think of it.

Sounds good. I'll update that PR when I got time (been busy lately). By upstream I'm referring to the issue that requests doesn't enforce content length check: https://github.com/psf/requests/issues/4956.

yichi-yang avatar Jan 31 '23 18:01 yichi-yang

2024 still no resume for large packages, the connection is closed by the server and i have to start numpy and psyspark over and over again, a resume would save a lot of resources as pip retrieves the same stream also all over again. I am sorry that i am not versed enough t write it myself, but it is necessary

nbkgit avatar Apr 20 '24 23:04 nbkgit

2024 still no resume for large packages, the connection is closed by the server and i have to start numpy and psyspark over and over again, a resume would save a lot of resources as pip retrieves the same stream also all over again. I am sorry that i am not versed enough t write it myself, but it is necessary

Yes very necessary

mrlectus avatar May 03 '24 10:05 mrlectus