pip icon indicating copy to clipboard operation
pip copied to clipboard

Speculative: --only-binary by default?

Open pfmoore opened this issue 3 years ago • 89 comments

What's the problem this feature will solve? A lot of users are reporting issues when there's no Python 3.9 binary for projects they need, and pip tries to build from source and fails with an obscure error (because the user doesn't have a compiler, or isn't set up to build the relevant packages).

Describe the solution you'd like Pip shouldn't try to build from source if the user isn't prepared to deal with build errors. As it's not possible to know the user's level of expertise, we should err on the side of caution, and by default only allow wheels to be installed. Users who know they need to install from source and have checked that they can do so, can explicitly say so using a new --allow-source flag, which acts as an "opt-in" to source builds.

Alternative Solutions Improve the error messages when a source build fails. This is hard, because the details of what went wrong are entirely the responsibility of the build backend.

Additional context I don't realistically think this can be added without a lot of disruption, but given that significant numbers of projects ship wheels these days, maybe it isn't as unthinkable as it once was. I do think it's worth discussing the implications, if only as a thought experiment, and I don't know where else we could do that apart from here.

One big problem area is that we can't distinguish between "pure Python" projects that are shipped only as sdists, but which only need Python to build, and complex projects that need a compiler. So restricting to wheels only would require an explicit opt-in for some projects which currently install with no issue.

pfmoore avatar Nov 16 '20 17:11 pfmoore

We could attempt to make the default more intelligent (or maybe just more magical). Basically have the implicit default be that if a wheel is found at all for some project, that project defaults to only allowing wheels.

Sent from my iPhone

On Nov 16, 2020, at 12:40 PM, Paul Moore [email protected] wrote:

 What's the problem this feature will solve? A lot of users are reporting issues when there's no Python 3.9 binary for projects they need, and pip tries to build from source and fails with an obscure error (because the user doesn't have a compiler, or isn't set up to build the relevant packages).

Describe the solution you'd like Pip shouldn't try to build from source if the user isn't prepared to deal with build errors. As it's not possible to know the user's level of expertise, we should err on the side of caution, and by default only allow wheels to be installed. Users who know they need to install from source and have checked that they can do so, can explicitly say so using a new --allow-source flag, which acts as an "opt-in" to source builds.

Alternative Solutions Improve the error messages when a source build fails. This is hard, because the details of what went wrong are entirely the responsibility of the build backend.

Additional context I don't realistically think this can be added without a lot of disruption, but given that significant numbers of projects ship wheels these days, maybe it isn't as unthinkable as it once was. I do think it's worth discussing the implications, if only as a thought experiment, and I don't know where else we could do that apart from here.

One big problem area is that we can't distinguish between "pure Python" projects that are shipped only as sdists, but which only need Python to build, and complex projects that need a compiler. So restricting to wheels only would require an explicit opt-in for some projects which currently install with no issue.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

dstufft avatar Nov 16 '20 21:11 dstufft

But we can’t know what “all projects” means before deciding whether to set the flag, since dependency information is inside the sdist/wheel 🙃

uranusjr avatar Nov 16 '20 23:11 uranusjr

@uranusjr I'm suggesting making --only-binary :all: the default, which doesn't need to know dependency information...

pfmoore avatar Nov 17 '20 07:11 pfmoore

Oops, my previous response was toward @dstufft’s “intelligent” suggestion. Sorry for the confusion.

To express my thoughts in more words, I think the “only wheel unless some project needs to compile from source” would be very difficult to implement since the two parts in the logic depend on each other. I would much prefer @pfmoore’s original suggestion of having --only-binary :all: unless the user explicitly allows source distributions.

uranusjr avatar Nov 17 '20 09:11 uranusjr

The logic isn’t hard and has nothing to do with dependency information.

Current logic is roughly:

  1. Fetch a list of links from the index for project X.
  2. Filter said list of links using the value of —only-binary (among other things like platform tag).
  3. Return list of links for use in the dep solver.

The proposed change only slightly changes the logic in step 2 slightly, such that unless the user has explicitly configured only-binary, we will set the value of it implicitly by inspecting the entire list of links we’ve discovered for project X, and determining if there is a wheel file or not.

This is simple, and would prevent the breakage that Paul is currently seeing, projects which generally make wheels available, but that haven’t for this version of Python / OS / Whatever.

It wouldn’t change anything for projects which don’t ship wheels at all, some of which will be pure Python, some of which will be compiled code, but in any case there’s no “upgrade to Python 3.9 and suddenly start compiling code” problem for these projects since they are consistent in what they require.

The biggest issue with this that I see is in the effort of being smarter about our default to not break certain kinds of projects, we make it easier for projects to accidentally break their users. If my project historically did not upload wheels, and then I start uploading wheels with version 3.1, all previous versions suddenly stop working without opting in to some flag. This is done without any obvious change by the user (upgrading versions of pip is an obvious change, but some thing I install starting to upload wheels is not).

We could work around that problem by trying to reduce the blast radius of the implicit “wheels only” setting, by saying that we will only filter out non wheel links by default that are of the same version of a wheel we’ve found. Thus if we find an sdist for 1.0, 2.0, 3.0, and 3.1 and we find a wheel for 3.1, when we filter the list of links, we will filter it so it has the sdists for 1.0, 2.0, and 3.0 and the wheel for 3.1.

This makes it so that as soon as you upload a wheel for a given version, you’re effectively signaling that not only should a wheel version be preferable, but that the sdist should only be used if explicitly configured to by the user.

Sent from my iPhone

On Nov 17, 2020, at 4:48 AM, Tzu-ping Chung [email protected] wrote:

 Oops, my previous response was toward @dstufft’s “intelligent” suggestion. Sorry for the confusion.

To express my thoughts in more words, I think the “only wheel unless some project needs to compile from source” would be very difficult to implement since the two parts in the logic depend on each other. I would much prefer @pfmoore’s original suggestion of having --only-binary :all: unless the user explicitly allows source distributions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

dstufft avatar Nov 17 '20 11:11 dstufft

Maybe we simply make --prefer-binary the default (rather than --only-binary)? I didn't suggest that originally because it means that we trigger "why don't I get the latest version?" questions. But maybe that's a less serious breakage?

pfmoore avatar Nov 17 '20 11:11 pfmoore

This makes it so that as soon as you upload a wheel for a given version, you’re effectively signaling that not only should a wheel version be preferable, but that the sdist should only be used if explicitly configured to by the user.

I like it? I think something like 98% of packages on PyPI have wheels in the latest release, so I don't think this is catastrophically bad.

Improve the error messages when a source build fails. This is hard, because the details of what went wrong are entirely the responsibility of the build backend.

IMO one of the improvements we should make here is adding a sentence like: "This failure occurred while trying to generate [a wheel / metadata] for packageName. This is not an error in pip."

This also applies to the proposed approach here too -- clearer error messaging would be good. :)

pradyunsg avatar Nov 17 '20 13:11 pradyunsg

I like it? I think something like 98% of packages on PyPI have wheels in the latest release, so I don't think this is catastrophically bad.

I suspect the number would be significantly lower if you count percentage of downloads instead. There are a bunch of popular pure-Python projects that don’t bother with wheels because the effect is minimal. django-grappelli is one of my favourite examples: it’s popular, well-maintained, regularly released, and has very spotty wheel support. --prefer-binary by default would break a lot of Django setups out there.

uranusjr avatar Nov 17 '20 13:11 uranusjr

I think something like 98% of packages on PyPI have wheels in the latest release

I'm pretty sure that's a figure I gave you, and I found the bug in my calculation a bit later 🙁 I need to re-do the sums, but I think it's a lot lower than that, unfortunately.

I suspect the number would be significantly lower if you count percentage of downloads instead.

The number's a lot lower without doing the sums incorrectly 🙂 Sorry about that. I don't have download information, but I'm re-doing the numbers right now, and I'll see what things look like if you factor in "uploaded a file in the last 12 months" as well.

I might try getting download numbers from the BigQuery data for offline analysis. Downloads per project, per year (month?) might be sufficiently interesting, if I can work out how to get that relatively easily in a CSV format or similar.

To confirm, my query has just completed. Comparing "number of projects that distribute sdists but no wheels for their latest version", vs "number of projects that distribute wheels for their latest version", the numbers are almost identical (124508 vs 124782). Looking at projects which have released at least one file in the last year, the values are 32890 and 66635.

So half of all projects, 2/3 of projects active in the last year, have wheels.

As I say, I think that however we did this, it would result in a lot of breakage.

pfmoore avatar Nov 17 '20 14:11 pfmoore

It's a backwards incompatible change, so regardless it's going to break someone. The goal behind my proposal is to limit the blast radius, so that we limit the breakage, either to specific projects, or to specific versions within a project.

I think there's two questions here too:

  • What do we want the long term position to be, are we happy saying that eventually a project that has never shipped, and will never ship wheels requires an opt in on the CLI to install?
  • Given the answer to the first, what stepping stones can we make to get there? Is there any or do we need a big bang migration?

I'm not sure about the long term "right" answer. I can see an argument that we want to encourage wheels where possible.. but I also think that there are some projects that simply cannot be shipped as wheels, and maybe will never be able to be shipped as wheels. We need to figure out if going wheel only by default will end up being worth it, or if we will push too many projects out of viability.

For the second one, I think having the default by to filter out sdists, for any project version that has any wheels uploaded, solves the main driver to this proposal, without breaking projects that are not shipping wheels (or used to ship wheels, but found out that was problematic). That could be useful as a stepping stone for getting to a wheel only default (for instance, we could provide warning when installing from sdist then), or it could be a reasonable end state that solves the surprising accidental sdist install, without dropping support for sdist only projects by default.

dstufft avatar Nov 17 '20 16:11 dstufft

I think we could do a lot better if we could somehow identify which projects are "hard" to build from source. I feel like blocking sdists that build into universal wheels is going a bit far. In the most general sense, that's basically impossible, but maybe we could add metadata somewhere (in the simple index?) to mark "pure Python" projects?

I agree it's not clear what the best long term answer is. We're seeing a lot more people using Python nowadays who honestly don't want to, or know how to, deal with building stuff from source. For those people, pip downloading a sdist that needs a compiler to build is almost certainly just a source of problems. But they are also precisely the sorts of user who won't know enough to add --prefer-binary. However, optimising for such users is going to impact a big chunk of our "traditional" user base negatively.

pfmoore avatar Nov 17 '20 16:11 pfmoore

I wonder if we can leverage PyPI in some way to encourage wheels, or to at least surface better information to highlight which projects don't ship wheels? This might be a better question for discourse? I dunno.

dstufft avatar Nov 17 '20 17:11 dstufft

I've got a big chunk of downloaded data from PyPI that I am querying to get a better feel for this sort of stuff. The biggest problem is the vast amount of (to be polite) "limited value projects" on there - without some form of insight, it's hard to know for sure whether it's OK to ignore a project called "0html" or "django-3-jet-zupit" - especially when it comes up in the same query as "090807040506030201testpip"...

pfmoore avatar Nov 17 '20 17:11 pfmoore

What if PyPI automatically builds the simplest pure Python wheels? There’s recent interest to detect malicious source distributions on PyPI, and the wheel it would produce as the side effect should be able to be reused.

uranusjr avatar Nov 18 '20 02:11 uranusjr

Any more thoughts here? I especially like the idea

... having the default by to filter out sdists, for any project version that has any wheels uploaded, solves the main driver to this proposal, without breaking projects that are not shipping wheels

The metadata option also seems reasonable, then the scientific python community could mark NumPy, Scipy, tensorflow, pytorch as "prefer binary by default" and save a lot of CI and cloud resources.

mattip avatar Jun 09 '21 13:06 mattip

I like the idea as well, maybe with a twist: Versions with only sdist are excluded, unless there are no wheels available at all prior to that version.

Use django-grappelli as an example, this means that

  1. Wheels are selected for 2.15.1, 2.14.4, 2.14.3, and 2.14.2.
  2. Sdists between 2.14.1 and 2.11.2 are all ignored since there are older wheels.
  3. Wheels from 2.11.1, 2.10.2, 2.10.1, 2.9.1, and 2.8.3 can be selected. Sdists between 2.11.1 and 2.8.3 are all ignored.
  4. Sdists from 2.8.2 downwards are allowed, since there are no wheels available past that version.

uranusjr avatar Jun 12 '21 14:06 uranusjr

For another data point here is an issue filed by a python3.5 user of cffi where they cannot build with the sdist, and changing the default would have helped them.

mattip avatar Jul 12 '21 09:07 mattip

Please edit the title binary-only -> only-binary. I always have to check pip --help to figure out the correct spelling.

mattip avatar Jul 15 '21 12:07 mattip

FWIW, that tells me that we should add an alias for that option.

pradyunsg avatar Jul 15 '21 12:07 pradyunsg

+1 for a solution via either package metadata or via a simple rule like "--only-binary :all: is applied if a package has any wheels".

Otherwise it has the risk of becoming a pip-only solution which is hard to understand. Today the problems mostly surface via pip because it's by far the most popular installer, but this is really a PyPI-ecosystem problem where the dual model of offering both source and binary packages and allowing freely mixing those is the root cause.

Sdists from 2.8.2 downwards are allowed, since there are no wheels available past that version.

This does not seem like a good idea. Not only is it harder to understand, it also partially defeats the purpose here. If a package has a very old source-only release (e.g., from the pre-wheels era) then that will be will be found the moment there's no suitable wheel for a user.

In your particular example, django-grappelli 2.8.2 is from 2016; a user who types pip install django-grappelli almost certainly does not want a version that old.

rgommers avatar Jul 16 '21 08:07 rgommers

Makes sense. I think it's quite difficult to gauge the actual impact here, since people here all care much about Python packaging (for apparent reasons) and likely push for wheels in projects we are involved. So I feel the only way to go forward is to actually try to implement this (maybe as a --use-feature first) and see if we can ~survive it~ make it work in real life usages.

There are probably still some implementation details we need to sort out. Should we go with --prefer-binary or --only-binary by default? How does a user disable this and prefer an sdist with newer version? etc. But I'm going to mark this as "awaiting PR" so anyone can try to come up with something. It's easier to put things into perspective when there is an implementation and test cases ~to object to 😛~.

uranusjr avatar Jul 23 '21 02:07 uranusjr

I would propose an alternate path forward. Rather than changing the default behavior of pip to prefer wheels, add a second CLI entry point of pipw (pipb?) which is an alias with the default of --prefer-binary / --only-binary (and maybe rejects any attempt to change source-only installs from pypi and local source installs?). I think adding a 'w' is a much easier mnemonic to remember that the right flag(s).

As has been mentioned above, pip currently mixes two different things (building and installing from source and installing from pre-built binaries) and I think it is a mistake to tilt pip even more in favor of being a binary-only package manager. By adding a new CLI entry point it is possible to make what ever changes are needed to make pip behave like a binary package manager without having to worry about breaking an existing users.

I think another issue here is a disagreement as to what exactly wheels are for. I have always considered (and I may be the only one to hold this position) the sdist the canonical source of truth for what the released version of the package is on pypi with the wheels are provided for the convenience of the user (the linux wheel spec is "manylinux" which suggests it is a best-effort rather than authoritative artifact!). I think making pip more-binary package-manager like by default will only re-enforce the expectation that projects will (promptly) provide a wheel for your platform / Python version / Python implementation and one not existing is a "bug".

There was a discussion on the numpy mailing list about the ever expanding number of platforms that projects are expected provide wheels for becoming un-sustainable (the latest beta-release of Matplotlib has 21 wheels and we are not yet covering the full Python version/Python implementation/arch/OS matrix https://pypi.org/manage/project/matplotlib/release/3.5.0b1/). If pip is going keep going down the path of binary packaging, I think there needs to more discussions about how filling out the build matrix can be lifted from the projects to some centralized build service like the homebrew, conda-forge, and the Linux distributions do already. Separating the wheels into their own channel/management chain would also make it easier to manage things like updating version pinning on the wheels post-facto (e.g. putting an upper bound on something or banning known-bad version combinations), re-building with updated versions of non-Python dependencies (xref https://github.com/h5py/h5py/issues/1942), or dealing with CVEs much easier.

tacaswell avatar Sep 19 '21 23:09 tacaswell

How can we make the abstract discussion here more concrete? I see a couple of subjects being mixed together

topic possible mitigation
aliasing only-binary and binary-only PR to implement, should be the least controversial change suggested here
providing a path for naive users to prefer wheels over sdists by making only-binary the default, making prefer-binary the default, or providing a different cli entry point competing PRs to do these would provide a forum for discussion over the name and/or need for this
preferences when using --prefer-binary when sdists are available for newer versions and wheels available for older ones ???
wider ranging changes in the way wheels are built and distributed for the growing Nd matrix of python-versions/implementations/os-versions/machine-architectures/available-hardware ??? - mailing list/discourse?

I apologize if I missed some of the topics here, please feel free to add to the table. The next question is who will do the work ...

mattip avatar Nov 24 '21 06:11 mattip

I’m dropping a link to the RFC proposing to disable install scripts by default for NPM, which would have roughly the same effect as making --only-binary the default (not --prefer-binary). npm/rfcs#488

uranusjr avatar Nov 24 '21 06:11 uranusjr

Most of what @mattip says looks right to me.

For the final point, I agree that this needs a wider discussion than just the pip tracker, once we start going beyond the basic "make users opt into building from source" approach. If we want a more complete solution, I'd suggest that interested parties post a proposal on Discourse for new metadata (which would need to be in sdists and exposed on the PyPI simple index, to be usable here) stating at least two things:

  1. Project is pure Python and needs no external tools to build.
  2. Project owners suggest that installers require user opt-in to build from source.

We'd need buy-in from setuptools at an absolute minimum (if setuptools won't write the relevant metadata to the sdist, it's essentially not going to be available to consumers) and that probably means setuptools needs to add support for PEP 643, as that's the only way we have of getting reliable metadata for sdists. Realistically, no build backend other than setuptools is an issue here, because only setuptools supports both "simple" and "insanely complicated" build processes 🙂

If (and honestly, this seems like a big "if" to me 🙁) we can get commitment from the various parties in the community, then that could become a PEP and implementation. But it feels like something that may be too much for the level of volunteer resource we have, so it would probably need funding to get anywhere. As the scientific/data science community has a strong interest in this, maybe there are grants around the sustainability/build infrastructure area that could be used for something like this?

pfmoore avatar Nov 24 '21 09:11 pfmoore

I’m dropping a link to the RFC proposing to disable install scripts by default for NPM, which would have roughly the same effect as making --only-binary the default (not --prefer-binary)

That's a very long discussion, but from what I could gather it's only motivated by security. While here we're talking about usability, and the issues around building of complex packages which is likely to fail. So while there may be overlap in impact, the tradeoffs are probably very different.

But it feels like something that may be too much for the level of volunteer resource we have, so it would probably need funding to get anywhere. As the scientific/data science community has a strong interest in this, maybe there are grants around the sustainability/build infrastructure area that could be used for something like this?

If it looks like there will be buy in for this idea from the relevant maintainers/parties, I'd be happy to lead the obtain-funding part.

rgommers avatar Nov 24 '21 10:11 rgommers

I think i have come across a relevant problem in this context: On a Raspberry Pi you would like to use binaries from https://www.piwheels.org/

But how to include the binaries from there into lock file?

See also:

  • https://github.com/piwheels/packages/issues/260
  • https://github.com/python-poetry/poetry/discussions/4816

jedie avatar Nov 24 '21 10:11 jedie

That's not relevant to this discussion @jedie. Yours is more a usage question suitable for Stack Overflow. If you do want to discuss a Pip design change, please open a new issue.

rgommers avatar Nov 26 '21 15:11 rgommers

Let's focus back on the original proposal here.

Do the @pypa/pip-committers as a group want to switch to --only-binary :all: being the default behaviour?

If we do, @rgommers has offered to find funding to make it happen[^1]. But nothing will happen until we reach some sort of consensus. The default is of course to do nothing, but even if that's what people prefer, it would be nice to be explicit, and state clearly that pip considers building from source to be just as fundamental as installing wheels. We could then close this PR and move on.

Details like whether we do --only-binary, --prefer-binary, or @dstufft's hybrid suggestion can be part of the implementation, once we have some level of consensus (and the funding 😉)

For anyone who wants more information @rgommers suggests here that the data science community would benefit from --only-binary :all: being the default.

FWIW, I'm +1 on making this change.

[^1]: Questions like "how do we transition", "what would happen to all those users who use pip to build their applications from source", "how do we handle all the hate mail from people affected", would be part of the funded work, so for now we can assume that's "someone else's problem", and concentrate on whether we support the principle.

pfmoore avatar Dec 23 '21 10:12 pfmoore

PS One interesting piece of data on the whole issue of sdist building, would be to trawl through the tracker here and identify what proportion of our issues are related to building sdists. I bet it's high. Unfortunately, I don't have the time to do this...

Maybe a label marking such issues would be useful?

pfmoore avatar Dec 23 '21 10:12 pfmoore