pip PERF: download and compute hashes in chunks of 1MB, did you know the progress bar was 30% of the runtime!

Hello, it's me again

I'm offering you 2 more improvements in this PR.

First fix, to download the file in chunks of 1 MB. Things to know, urllib is broken, depending on which function you use, it can download in chunks of 1 byte or 10240 bytes by default. They have some tickets about that somewhere for years, but nobody has been fixing them. pip code was setting the chunks to CONTENT_CHUNK_SIZE=10240 from urllib which is the bad constant. You don't want to do that ⚠️

Special case of pip. pip is updating a progress bar after each chunk is downloaded. This is doing an insane amount of progress bar updates, that can take as much as 30% of the runtime for a large package :D

The PR is downloading in chunks of 1 MB. That's a reasonable size for I/O operations.

Profiling on pip install --prefix /var/username/deleteme tensorflow-cpu --no-deps --no-cache-dir --dry-run that takes 3 seconds to run the main().

MASTER BRANCH

FIX BRANCH

Notice: the tensorflow wheel is 207 MB. It was updating the progress bar 20237 times, or every 10240 bytes. Notice the same amount of calls to read() fp_read() stream read() etc...

Second fix, after the wheel is downloaded, pip is reading back the file in blocks of io.DEFAULT_BUFFER_SIZE to compute hashes of the file. io.DEFAULT_BUFFER_SIZE is an obsolete constant that was set to 8k forever ago. You don't want to use that. By the way, I have some tickets and PRs open on the python interpreter to fix that constant but I don't know if we will ever get to merging them https://github.com/python/cpython/issues/117151

This one doesn't have too much impact thankfully, the downloaded file should be in the read cache because it was just written, and it is written to /tmp as a ramdisk for me. So it makes little difference on my machine but that really depends what type of OS and disks you have.

Cheers.

Jun 28 '24 16:06 morotti

Hello, all green and all comments reviewed, should be good to merge now

Jul 01 '24 16:07 morotti

I've replaced 1024*1024 with a constant as requested.

two constants actually, because the network chunks and file chunks don't necessary need to be the same size and we don't necessarily want these files to import each other just for one constant.

Jul 08 '24 09:07 morotti

(pingback)

ping back to the bug tickets in requests, they've had the issue opened for nearly 10 years to stop using a small chunk size but it was never fixed ^^ https://github.com/psf/requests/issues/3186 https://github.com/psf/requests/issues/844

Jul 08 '24 10:07 morotti

one quick check on older version of pip on older python version the downloading is nearly 3 times faster =)

pip version 21.0.1 on python 3.8

default chunk size 10 kiB time pip download torch --no-deps --no-cache ...

 |████████████████████████████████| 1801.8 MB 65.9 MB/s

Saved ./torch-1.13.1+cu117-cp38-cp38-linux_x86_64.whl Successfully downloaded torch

real 0m30.120s user 0m16.202s sys 0m13.398s

larger chunk size 1 MiB time pip download torch --no-deps --no-cache ...

 |████████████████████████████████| 1801.8 MB 240.4 MB/s

Saved ./torch-1.13.1+cu117-cp38-cp38-linux_x86_64.whl Successfully downloaded torch

real 0m15.173s user 0m5.931s sys 0m6.345s

Jul 09 '24 09:07 morotti

FWIW I agree with @ichard26, there have been complaints before from users who are on slow or unreliable connections that pip isn't always friendly.

With this PR there becomes this awkward middle ground of file sizes (between 512 KB and 10 MB) where they will complete extremely fast and not be noticable on people with high bandwidth connections, but may appear stuck on a user who has a slow enough connection that takes a noticeable amount of time.

I assume chunk size is fixed, and pip can't determine ahead of time how fast or reliable a connection is. So @ichard26 approach seems a good compromise to me (without debating the specific numbers which could always be tweaked).

Jul 09 '24 21:07 notatallshaw

I've been discussing my review with others (in the PyPA Discord) and Ethan made a suggestion that involves dynamically updating the chunk size:

[W]ould it be possible to do an adaptive download chunk size based on the time it took to download the last chunk? So you start small and increase the download chunk size until the last chunk was "slow" by some metric?

TBH it sounds even more complicated and I don't think I'd want to maintain such logic, but it could be a valid approach.

Jul 09 '24 21:07 ichard26

If chunk size is dynamic (which it doesn't look like?), I think it would make sense to start at 8 kB and keep doubling until some time threshold was exceeded (e.g. 1 second) or some maximum value was reached (e.g. 1 MB)

Jul 09 '24 22:07 notatallshaw

Hello,

I've made adjustments. We can now reach ~450 MB/s download speed internally, up from ~60 MB before this PR.

Chunks are set to 256 kB. That's sufficient to show regular progress on the slowest connections, without being detrimental to performance.

Do not use smaller chunks, as this will incur negative performance waiting on I/O operations from the device and for the interpreter to run more iterations (note that python 3.11+ brought significant improvements on the interpreting speed).

Most devices, especially HDD or USB devices, would benefit from larger block size (1+ MB ) but the difference is not necessarily significant. (I see some inefficient copy in the SSL code to reconstrcut large packets and they'd fall outside of CPU cache, which could incur microsecond optimizations, but this is outside of the scope of this patch.)

The refresh was limited to 5 times per second. It helps a lot. Thanks for pointing this out @ichard26

(It's funny because I'm testing and the UI "feels" more responsive the more often it's flashing (I can force refresh 60 times a second and it's very flashy!) but that doesn't make anything faster. Users won't run 3 windows on pip install with 1, 5 and 60 refresh a second and won't tell the difference)

Last I checked 3 years ago we were talking ~40% of the UK with <8 Mbps connections including ~10% with ~2Mbps or below. I myself lived with a 2Mbps connections for many years.

Downloading a movie takes a whole hour (tensorflow and torch-cpu are in that order depending on platform). Downloading a game on steam can take 3 days. There is no improvement to the progress bar you can make to make the experience less terrible. Users just have to wait :D You guys don't need to worry about showing progress so much. Having any form of progress bar is good enough really.

Jul 10 '24 12:07 morotti

Last I checked 3 years ago we were talking ~40% of the UK with <8 Mbps connections including ~10% with ~2Mbps or below. I myself lived with a 2Mbps connections for many years.

pip can be used not only with PyPI, but also with local repository or proxy repository (e.g. JFrog Artifactory, Sonatype Nexus). Downloading packages within internal company network can be much faster. I'd prefer having an instantaneous CI builds, if it it possible. What about checking if terminal is a TTY, and if not (e.g. CI run, Docker build), use the maximum possible chunk size.

Jul 10 '24 12:07 dolfinus

I see failing tests.

rebased on master and squashed all the commits. the next build should pass.

Jul 10 '24 12:07 morotti

(repushing to trigger builds, there seem to be a flaky test on Windows)

Thanks, I've adjusted the progress bar to only render for packages > 512 kiB, up from 40 kiB. I think that's a reasonable cutoff.

A simple "pip install jupyter" is installing 60 packages, most are very small. I'm finding the pip output easier to read and follow, with less progress bars appearing and disappearing very fast.

read() returns the provided chunk size, except for the last chunk.

Do more enterprises have 2.5+ Gbps ethernet links to their internal network than I think...?

A lot of companies have employees working on a remote machine (VM, VDI, RDP, remote terminal, ssh, etc...). Hardware has been dual 10 Gbps for more than a decade, 40 Gbps is common nowadays.

Jul 11 '24 09:07 morotti

There seem to be two tests in master that are flaky on Windows. Added in May.

FAILED tests/unit/test_utils_retry.py::test_retry_wait[0.015] - assert (669.765 - 669.75) >= 0.015 FAILED tests/unit/test_utils_retry.py::test_retry_time_limit[0.01-10] - assert 11 <= 10

EDIT: i see they are discussed in another PR https://github.com/pypa/pip/pull/12839

Jul 11 '24 09:07 morotti

(rebasing to pick up test fixes on main)

Jul 15 '24 09:07 morotti

All green now that main branch has been fixed. Would you be able to merge the PR? @ichard26

final result: time pip download --dest /tmp/deleteme --no-cache tensorflow torch xgboost a few large packages typical of machine learning use cases, 51 packages total. it cuts the runtime almost in half. download speeds is going from ~100 MB/s to ~400 MB/s 🚀 🚀 🚀

on main:
real    0m37.430s
user    0m19.404s
sys     0m15.371s

with the patch:
real    0m23.292s
user    0m10.226s
sys     0m10.907s

Jul 15 '24 10:07 morotti

Thanks, I've adjusted the progress bar to only render for packages > 512 kiB, up from 40 kiB. I think that's a reasonable cutoff.

I'm not sure if this is the right call, but I don't feel strongly enough so I won't block the PR on this. We can see if anyone complains later.

I don't have the commit bit so I can't merge anything. I'm a triager, not a maintainer as I'm too new to the project :)

Jul 15 '24 15:07 ichard26

This seems like a good improvement, and it's easy enough to revert if it causes issues, so I'm going to merge it pre-emptively. @pradyunsg if you have concerns for 24.2, feel free to revert it.

Jul 17 '24 15:07 pfmoore

pip pip copied to clipboard

PERF: download and compute hashes in chunks of 1MB, did you know the progress bar was 30% of the runtime!

pip
pip copied to clipboard