pip icon indicating copy to clipboard operation
pip copied to clipboard

Add parallel download support to BatchDownloader

Open NeilBotelho opened this issue 2 years ago • 8 comments

This PR adds parallel download support to BatchDownloader and adds a cli option --parallel-downloads <n> to the install and download commands. If the option is not specified, or if set to 1, it doesn't change pip's behaviour or UI.

BatchDownloader is only called by _complete_partial_requirements after resolution is completed, so it wouldn't affect the resolution process as far as I can tell. It is used to download wheels where only metadata access was available for resolution. So as more packages adopt PEP 658 the number of wheels we would be able to download in parallel would also increase :)

Note

Importantly, this PR does not add UI/progress bars for parallel downloads, it just logs the Downloading <wheel-file-name> message and disables the progress bar. ~I am working on a separate PR that adds a progress bar for parallel downloads. These 2 PR's can then be combined. I don't expect this PR to be merged until that PR is opened.~ I have opened a separate PR to add UI support for parallel downloads

I've done it this way because the UI requires some discussion, due to limitations of how rich and even tqdm render parallel progress bars in Jupyter notebooks (see this issue). Currently the best way I can think to circumvent this issue is to add a flag (something like --jupyter) to be used when downloading in parallel in jupyter notebooks that will disable the progress bar only for the parallel downloads. But this isn't ideal.

Also, if someone could tell me where I should ideally initiate this sort of discussion(discourse/IRC/discord etc.) I would appreciate it :)

NeilBotelho avatar Nov 04 '23 19:11 NeilBotelho

Great work! Do you have a link to your other PR for the progress bar change?

ofek avatar Nov 27 '23 15:11 ofek

Great work! Do you have a link to your other PR for the progress bar change?

Yep here it is: https://github.com/pypa/pip/pull/12404

NeilBotelho avatar Nov 27 '23 17:11 NeilBotelho

@NeilBotelho Thanks for filing these PRs! They're on my personal radar to review, but I'm unlikely to have the bandwidth to review pip PRs in the coming weeks and will likely only look at this early next year.

One of the other maintainers might have bandwidth to look at this before then, of course. I'm not going to volunteer someone else but just erring on the side of communicating my (lack of) availability so that the lack of a response isn't misjudged to be a lack of interest! :)

pradyunsg avatar Nov 27 '23 17:11 pradyunsg

@pradyunsg no issues! I totally undestand, and I appreciate the transparency. I'm just happy to finally be contributing back to pip :)

NeilBotelho avatar Nov 27 '23 17:11 NeilBotelho

I'll also note that I have a number of personal commitments right now that mean I'll be unable to review this in the near future. Sorry!

pfmoore avatar Nov 27 '23 19:11 pfmoore

This PR is very exciting for the speed-up it could offer pip in multi-core systems! I'll keep an eye on it

mkleinbort-ic avatar Feb 13 '24 14:02 mkleinbort-ic

I wonder how this should be unit-tested.

uranusjr avatar Mar 12 '24 20:03 uranusjr

I wonder how this should be unit-tested.

One option to test this would be to unit test the _download_parallel method of BatchDownloader by adding Links for 2 packages. I don't know if this is best practise though.

I do see some examples of packages being downloaded in tests/functional/test_install_extras.py, so maybe a functional test of pip install end to end with the --parallel-downloads option passed might be better

NeilBotelho avatar Mar 25 '24 12:03 NeilBotelho

(I am not a pip maintainer but trying to do some optimizations in pip now)

Hello,

Before doing parallel download, you will need this PR to be merged first to fix/optimize the download code https://github.com/pypa/pip/pull/12810/files

turns out, the code that's doing the download was downloading in small 10k bytes chunk and upgrading the progress bar after every chunk. as much as 30% of the download time is actually just wasted to re-render the progress bar ^^

If you see performance improvements with this PR by adding parallel downloads, I'm sorry to say this might be in large part because it removed the progress bar.

I'm afraid parallelization of downloads is unlikely to give huge performance improvements until the linked PR to fix download is merged, otherwise most of the time is taken in pip python code that's holding the GIL. (I think the download in openssl/urllib and the hash calculation should free the GIL and allow other treads to run, unfortunately they only get chunks of 10k and 8k respectively before getting back to python code)

Cheers.

morotti avatar Jul 03 '24 13:07 morotti

(not a pip maintainer)

my performance improvements to download code have been merged in 24.2. I see download going from ~60 to ~460 MB/s on internal network, mostly outsourcing to C code in requests/ssl that does not hold the GIL. that should leave room for concurrent threads to run.

now would be a good time to rebase and bring this PR back to life.

on the code review, you will want to remove the 2 functions _sequential_download() + _download_parallel() IMO there should be a single function. sequential download is simply an effect of using workers=1. that way, there is a single code path to write and to maintain, which will be tested by all the tests.

morotti avatar Aug 15 '24 17:08 morotti

I haven't had a chance to work on this in a while and I see that someone recently opened a PR to address this same issue, so I'll close this in favour of it.

NeilBotelho avatar Aug 22 '24 16:08 NeilBotelho