pip icon indicating copy to clipboard operation
pip copied to clipboard

Parallel downloads

Open cool-RR opened this issue 12 years ago • 51 comments

How about having pip download all the packages in parallel instead of waiting for each one to finish before downloading the next?

After that is implemented, how about having pip start installing one package while it's downloading the next ones?

cool-RR avatar Mar 04 '13 20:03 cool-RR

Does anyone care?

cool-RR avatar Jan 20 '14 14:01 cool-RR

Yes we do care. However it's largely a problem of there being bigger issues to tackle first. Sorry that nobody responded to your ticket though!

dstufft avatar Jan 20 '14 17:01 dstufft

If we're looking at parallelizing just the download part, is it much more complex than sticking a concurrent.futures.ThreadPoolExecutor on it and adding a configuration option to turn it off?

On Mon, Jan 20, 2014 at 7:23 PM, Donald Stufft [email protected]:

Yes we do care. However it's largely a problem of there being bigger issues to tackle first. Sorry that nobody responded to your ticket though!

— Reply to this email directly or view it on GitHubhttps://github.com/pypa/pip/issues/825#issuecomment-32779319 .

cool-RR avatar Jan 20 '14 17:01 cool-RR

I disagree. Most downloads are far slower than a disk can write, even if writing many files. As an optional --feature, I think it is a trival choice -- people who know they are downloading over a LAN and onto a slow disk would chose not to --enable such feature. People like myself downloading several-hundred megabyte packages over broadband would gain much to chose to use such --feature.

jquast avatar Jun 23 '14 20:06 jquast

+1 to parallel download (and install if possible).

sholsapp avatar Nov 18 '14 23:11 sholsapp

+1, I frequently work on projects that have hundreds of dependencies and are built from scratch repeatedly throughout the average workday. This stuff crawls even with a pypi mirror on the LAN so having this would be great! http://stackoverflow.com/questions/11021130/parallel-pip-install Options presented are not great, but it does show there is (and has been) interest in a solution to the issue

mattvonrocketstein avatar Oct 01 '15 20:10 mattvonrocketstein

+1, this would be huge. Willing to take a crack at implementing.

jamesob avatar Oct 20 '15 00:10 jamesob

If anyone is curious why this should be a core feature and not something an external tool provides I can summarize the problem. Using something like "xargs --max-args=1 --max-procs=4 sudo pip install < requires.txt" in general can actually increase the total downloads quite a bit. This happens when for example "django-foo" and "django-bar" both require Django, but specify different constraints for which version is acceptable. I think pip itself has to compute the unified requirement set (probably including requirements-of-requirements) to avoid redundant downloads.

mattvonrocketstein avatar Oct 20 '15 01:10 mattvonrocketstein

Do knowledgeable people know if parallel install is an option? When I tried implementing a build system that executed pip in parallel, I hit problems specifically related to installing modules that had native extensions, but didn't dig further into it. If people know what might be wrong there, I'd love to know.

sholsapp avatar Oct 20 '15 15:10 sholsapp

+1 to parallelization. Not just for downloading but also for things like "Collecting ... File was already downloaded" with pip wheel. It is painfully slow for projects with huge dependency list and multiple hosts, it sometimes becomes one of deployment bottlenecks. I had to add a cache layer on our project - run it only if requirements.txt has changed.

fillest avatar Apr 27 '16 08:04 fillest

note that enabling complete "parallelism" amounts to a relatively complete re-factoring of many internals this needs an idea for a starting point and quite many re-factoring steps over time

also there is ux issues, while downloading/unpacking is simple, wheel generation and setup.py invocation are full of error potentials/conditions (however they are needed until everybody uses wheels whenever possible and they are needed right in the middle of execution)

RonnyPfannschmidt avatar Apr 27 '16 08:04 RonnyPfannschmidt

so if the UX complication is sidestepped with only parallel downloads/unpacking, doesn't it make sense for that to be the first goal? full parallelism does seem like a huge undertaking.. i'm no expert but as far as that implementation here's two ideas that came to mind

a) in the set of all the requirements and the requirements of requirements, subsets must be discovered wherein individual requirements must be installed sequentially. all such subsets can then be installed in parallel at least. as far as ux, errors here can be shown whenever they occur because any error encountered is truly an error.

b) in the set of all the requirements and the requirements of requirements, a parallel worker pops something out and tries installing it. if the req installs successfully, great. if it fails, we assume another requirement must be installed first and put it back in. as far as ux, errors here must be suppressed unless they persist until the end of this procedure, because any error encountered might be something we can recover from later in the process.

mattvonrocketstein avatar Apr 27 '16 17:04 mattvonrocketstein

the ux issue cannot be side-stepped if any of the downloaded packages is a sdist

the requirement graph is nonlinear and changes whenever more about the dependencies and constrains of a just freshly downloaded package gets known

as soon as a sdist is downloaded, the build/egg-info/wheel process of it has to be triggered

RonnyPfannschmidt avatar Apr 28 '16 06:04 RonnyPfannschmidt

Here's what I think...

To be able to have parallel downloads, pip would need to determine which packages are to be installed and where they need to be fetched from. As of today, it is not possible since pip cannot determine the dependencies of a package without downloading it. ~While there's a bunch of interesting p~

That said, this is definitely something that'll be pretty awesome to have. :)

pradyunsg avatar May 16 '17 20:05 pradyunsg

I understand that you won't know follow-up dependencies before downloading a package. But looking at a requirements.txt, I would assume you could start a parallel download of all of those packages and then expand the list of things to download as you discover more dependencies. The only edge case I can think of right now would be different version requirements on the same dependency. But that problem exists with regular downloading as well I would assume.

fruechel avatar May 16 '17 22:05 fruechel

I didn't mean to post that message yet. Oops.

Anyway, the point I was making was that the way pip currently handles dependencies, there's a race condition where 2 packages depend on a common third with compatible but different version specifiers. Whichever package is downloaded first, its specifier would be used and a bad thing - you have behaviour that changes because of how the network behaved.

The only right way to do this is then to have dependencies metadata on pypi and only then can we determine the packages beforehand and proceed to parallel download/installation. Or somehow managing this during downloading?

Or I'm missing something about this issue.

pradyunsg avatar May 17 '17 02:05 pradyunsg

@fruechel Yes. Except with the serial downloads, the version of the common dependency is deterministic.

pradyunsg avatar May 17 '17 02:05 pradyunsg

Understood. You're right, it would make the process non-deterministic, based on random network behaviour. So yeah until that issue is resolved, implementing this would introduce problematic behaviour that you wouldn't want to build in.

fruechel avatar May 17 '17 06:05 fruechel

there's a race condition where 2 packages depend on a common third with compatible but different version specifiers. Whichever package is downloaded first, its specifier would be used

This problem would seem to be limited to parallel installation. If we're talking strictly about parallel downloads followed by standard, serial installs.. you might experience an extra, useless download.. but I'm not clear on why it should introduce anything nondeterministic.

I see a few potentially different scenarios for this improvement, and maybe it's useful to avoid conflating them:

  • parallel requirements downloads (single-level),
  • parallel requirements downloads (nested requirements),
  • parallel installation of requirements (as individual downloads are completed)
  • parallel installation of requirements (in some second pass, after all downloads are complete)

Of course having several of these things would be awesome, but any could be an improvement. In this thread people have raised at least 3 separate blockers from what I can tell:

  • missing requirements metadata
  • prerequisite but large-scale refactors of existing code
  • introducing nondeterminism

I'm less clear on which blockers affect which scenarios

mattvonrocketstein avatar May 17 '17 08:05 mattvonrocketstein

Hmmm this is where class dicts with an entry of the downloaded dependencies names would work so then it can bite the issue about extra downloads. And then the install system can look at that dict that is only populated at run time to install all the packages in that dict. And the dict itself on every entry could have a class that stores the information needed for the install mechanism in pip. I think this can theoretically be used for parallel installs.

AraHaan avatar May 17 '17 13:05 AraHaan

What is probably the most feasible is to download all requirements in parallel until reaching a source distribution, and then download in serial. Something like:

wg = WaitGroup()
while True:
    if req is Wheel:
        wg.Add(1)
        req.download_async()
    else:
        wg.Done()
        req.download()

ghost avatar Jul 31 '17 18:07 ghost

I've labelled this issue as an "deferred till PR".

This label is essentially for indicating that further discussion related to this issue should be deferred until someone comes around to make a PR. This does not mean that the said PR would be accepted - ~it has not been determined whether this is a useful change to pip and that decision has been deferred until the PR is made.~

pradyunsg avatar Aug 20 '17 06:08 pradyunsg

That said, I think it's pretty much clear that this is would be a welcome improvement. :)

pradyunsg avatar Aug 20 '17 06:08 pradyunsg

I've written something for Python 3 using asyncio for parallel downloads and installation of wheel packages: wi.

That tool can be used with pip for fallback, i.e, when there's no wheels available and my goal is someway get that code into pip or at least to serve as motivation to move this ticket forward, since the performance gain is really noticeable.

chromano avatar Dec 20 '17 10:12 chromano

@samuelcolvin expressed some interest in looking into this issue, over at pypa/packaging-problems#261.

pradyunsg avatar May 27 '19 00:05 pradyunsg

@chromano Why not make your package also work on non-wheels, contains c extensions, or just completely written in pure python.

And then have pip use that for parallel downloads and installs?

And also for the case of slow hdd writes, fall back to downloading to a memory pointer, then do a scheduled I/O queue (which uses locks and is ran on another thread and the lock is checked to see if it is unlocked yet). For locks I usually use bools and do a (in C):

#include <stdbool.h>

bool locked = false;

void doSomeFileIO(const char *lpData, const char *lpFileName)
{
  locked = true;
  // write the data to file.
  locked = false;
}

// somewhere call the above function using some sort of thread library in a new thread
// when needing to schedule file writes.
// obviously this should be called in a method and the actual parameters should be to the threading library to pass to the function and in C the name of the function should be passed as a poiner.
if (!locked)
{
  doSomeFileIO(data, filename);
}

AraHaan avatar Jun 03 '19 15:06 AraHaan

@AraHaan pip can't vendor a package with C dependencies for a bunch of reasons. And pip can't depend on a package that's not vendored in it for a different bunch of reasons (keyring is a special snowflake).

I don't have the bandwidth to explain more right now. It's 2am.

pradyunsg avatar Jun 03 '19 20:06 pradyunsg

can't vendor a package with C dependencies for a bunch of reasons. And pip can't depend on a package that's not vendored in it for a different bunch of reasons

Can it depend optionally?

KOLANICH avatar Jun 04 '19 06:06 KOLANICH

Can it depend optionally?

Generally, no.

If pip uses a package with C dependencies, that package can't be uninstalled due to how Windows handles open DLLs and if that package fails for some reason at C level, pip is rendered non-functional with no trivial way to correct the installation for users.

pradyunsg avatar Jun 04 '19 06:06 pradyunsg

BTW, conda is written in python itself, but manages somehow to update python interpreter. We may want to borrow something from it.

KOLANICH avatar Jun 04 '19 15:06 KOLANICH