pip icon indicating copy to clipboard operation
pip copied to clipboard

pull preparer logic out of the resolver to consume metadata-only dists in commands

Open cosmicexplorer opened this issue 2 years ago • 26 comments

This PR is on top of #12871; see the +337/-23 diff against it at https://github.com/cosmicexplorer/pip/compare/refactor-requirement-preparer...cosmicexplorer:pip:metadata_only_resolve_no_whl?expand=1.

Continuation of #11512, now with test cases and a lot of comments.

  • Closes #11512.

Problem

The use case proposed in #53 uses pip just to execute its resolution logic, in order to generate e.g. a lockfile, without pulling down any dists (which may be quite large). We eventually arrived at pip install --report --dry-run as the way to enable this. However, the workstream to improve pip performance has been somewhat fractured and distributed across multiple efforts (as often occurs in open source projects), and by the time pypi had managed to enable PEP 658 metadata, it turns out we had failed to implement the caching behavior we wanted, as described in #11512.

Solution

  1. Avoid finalizing the requirement preparer at all in the resolver itself, and require each command to explicitly instruct the requirement preparer to download anything further.
    • To clarify the change that's occurring here, rename RequirementPreparer#prepare_linked_requirements_more() to .finalize_linked_requirements(..., require_dist_files=True).
      • If require_dist_files=False, the requirement preparer will not download wheels if it already has metadata. This resolves #12603.
    • This separation will hopefully make it easier to progressively cache/virtualize other sections of the resolve process later on as per #12184.
  2. Move the test_download_metadata() tests into a new file test_install_metadata.py, because the metadata tests were supposed to be done against the install command anyway.

cosmicexplorer avatar Jul 28 '23 02:07 cosmicexplorer

cc @pfmoore @uranusjr @jjhelmus @ddelange

cosmicexplorer avatar Jul 28 '23 02:07 cosmicexplorer

q @pradyunsg @sbidoul @ anyone else: as I identify in the OP, this PR is less of a "bugfix", since we had not agreed on the exact behavior to commit to, and can almost be viewed more as new functionality, where we are now augmenting the guarantee of install --dry-run to include the new behavior that avoids downloading any dists at all when metadata is available. Is that a useful framing, or is it easier to keep this as a bugfix in the NEWS entry?

cosmicexplorer avatar Jul 28 '23 03:07 cosmicexplorer

Question, would this be easier if we remove fast-deps first?

uranusjr avatar Jul 28 '23 05:07 uranusjr

I don't believe so. As I went on at depth about in https://github.com/pypa/pip/pull/11478#issuecomment-1653555553, the new logic we added to implement fast-deps (especially the changes to the requirement preparer to enable metadata-only dists) is still necessary to make PEP 658 support work at all. At this point, removing fast-deps itself is actually rather easy, since we've now compartmentalized it into only one of multiple ways to construct a requirement without a backing dist on the local disk.

The part that has proven error-prone, imo, is just the intrinsic difficulty of the problem of retrofitting these "virtual" dists into a codebase that wasn't architected for lazy computation (because it didn't need to). I believe that this change improves that situation, by rearranging the methods of the requirement preparer to make it very explicit where the "virtual" dists are finally hydrated into real installed dists. I am aware the diff looks quite large, but the vast majority of that is from transplanting the metadata tests to use install instead of download, which was the biggest source of confusion surrounding these change.

Personally, I think that fast-deps is still worth keeping at this point, because from what I've seen, the error-prone code here has been entirely within the requirement preparer, having nothing to do with the zip parsing. Right now I see fast-deps as an alternate way to test our code paths for "virtual" (metadata-only) dists outside of the specifics of PEP 658, and I think pip will benefit from hammering away at the "virtual" dist code paths in the future (see #12184).

...however, the fantastic work done to decouple the actual zip operations from the requirement preparer also means that it wouldn't be difficult to remove fast-deps entirely. I am just absolutely positive that fast-deps is orthogonal to the difficulty I have seen with implementing "virtual" dists, and I think that that difficulty we've seen is either intrinsic to the problem of efficiently resolving dependencies along with just normal tech debt we've been paying back over time.

I also believe the "virtual" dists workstream has been made more difficult than usual because it has inspired a lot of excitement about improving pip, so pip maintainers have had to piece together stuff written by many different people over time. I created this as a separate PR because I recognized there were several things to fix at once with the one I based it off of, and I thought it would be easier to have a single person with more context on the issue (me) take ownership.

cosmicexplorer avatar Jul 28 '23 05:07 cosmicexplorer

Did a quick skim over the PR, as I'm wrapping up to head to bed -- not taking a closer look right now since CI isn't particularly happy with this right now.

One thing about fast-deps -- there's low-hanging fruit there seemingly, based on the final few things stated in #8670 that I've not looked into.

pradyunsg avatar Jul 31 '23 00:07 pradyunsg

There’s likely some details this isn’t getting correct now, but the general direction looks right to me. Also thanks for the “finalize” rename, that needed to be done. Should we also rename the needs_more_preparation flag?

uranusjr avatar Jul 31 '23 00:07 uranusjr

Ok, I've figured out the CI issues and I've also split this change out into meaningful commits, which makes it much more clear that most of this change is rearranging test cases. I recommend looking through the preceding commits as well, but the vast majority of this change is the +303,-91 commit at a67a9a8699627950a385da8532ee499bc5453c16 (much less intimidating!).

cosmicexplorer avatar Jul 31 '23 11:07 cosmicexplorer

Ok, it was quite unclear to me what BuildTracker was actually doing, but now that I've figured it out, I decided to add a few comments to it in 7ac10fcb53b7b9bb75700d0362f52ddd7e84d5b0. Please let me know if I've gone overboard there, I can easily make that into a separate PR.

cosmicexplorer avatar Aug 01 '23 00:08 cosmicexplorer

@jjhelmus please let me know if you have any thoughts on my remix of your change here. I took it over because I wrote the code that caused this bug and I wanted to fix things up to stop this from happening again, but I want to make it clear that your PR correctly fixed the issue and you should get credit for this. Thanks so much!

cosmicexplorer avatar Aug 01 '23 16:08 cosmicexplorer

I've just left some explanatory comments above. Reiterating to reviewers: I have split this large change into commits which should be much easier to review than all at once.

I'm in no rush at all, just wanted to leave that. Sorry for the pings! ❤️

cosmicexplorer avatar Aug 01 '23 17:08 cosmicexplorer

Thanks for this effort! Would it make sense to split it in separate PRs? In particular those that reshuffle or add tests for existing behaviour would be easy to review and merge. Also the commit about the build tracker says it adds comments but in practice seem to do a lot more and might be easier to review independently.

sbidoul avatar Aug 01 '23 17:08 sbidoul

@sbidoul thank you so much, I will absolutely do that now.

cosmicexplorer avatar Aug 01 '23 17:08 cosmicexplorer

Upon @sbidoul's recommendation, I have separated out some of the more sprawling changes that were in this PR into multiple separate PRs noted with checkboxes in the OP. Please reviewers take a look at those when you have time. Thanks!

cosmicexplorer avatar Aug 02 '23 05:08 cosmicexplorer

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will eligible for code review and hopefully merging!

pypa-bot avatar Aug 05 '23 04:08 pypa-bot

Hi @cosmicexplorer 👋

I'm watching all your efforts with a lot of excitement from the sidelines!

It's looking good (and really fast!) at a first look:

$ pip install 'pip @ https://github.com/cosmicexplorer/pip/archive/refs/heads/metadata_only_resolve_no_whl.zip'
$ pip install --no-cache-dir --ignore-installed --dry-run --no-deps --report - tensorflow
Collecting tensorflow
  Downloading tensorflow-2.13.0-cp38-cp38-macosx_10_15_x86_64.whl.metadata (3.2 kB)

I'm very much looking forward to this (huge) speedup. Thanks a lot, also to the reviewers! 💕

ddelange avatar Aug 09 '23 10:08 ddelange

@ddelange also see https://github.com/pypa/pip/issues/12184#issuecomment-1676002698, where I introduce some new caching that brings resolve times down to 4-5 seconds (not including any downloads at the end). Will be making PRs out of those changes once I get this one in so as not to overwhelm reviewers.

cosmicexplorer avatar Aug 14 '23 06:08 cosmicexplorer

@jjhelmus just as a status update: I'm making this PR depend on #12197 because it uses the test framework for generating dist metadata from that change, but I'm still working to get it merged because I recognize this enables a lot of other stuff built on top of pip (e.g.pantsbuild/pex#2210). Thanks so much again for jumping in and solving this; if you're using pip install --report in an application I would love if you could test out this branch.

cosmicexplorer avatar Aug 19 '23 23:08 cosmicexplorer

cc @uranusjr @pradyunsg @ others: I believe this is now ready for review!

cosmicexplorer avatar Sep 03 '23 01:09 cosmicexplorer

ping @pfmoore @uranusjr @sbidoul: I am working further on #12256, but would really appreciate it if we could merge this first so we don't have multiple changes to metadata-only dist usage in flight at once.

cosmicexplorer avatar Sep 14 '23 16:09 cosmicexplorer

I don't suppose one of the pip maintainers have a chance to review this soon? I've reviewed it myself (above) fairly carefully - and the PR already had some review passes earlier from pip maintainers, so hopefully the final review now from someone shouldn't be too time consuming :-)

edmorley avatar Feb 13 '24 08:02 edmorley

I'll be honest, I've avoided looking at this (in spite of the fact that I'm generally in favour of the idea) because it (and @cosmicexplorer's other PRs) seemed to be changing very rapidly - there were a lot of force-pushed changes in a short period, meaning that things were changing as I looked at them 🙁 In addition, this is a big change affecting a lot of files, so reviewing isn't a fast process (especially when you have limited time, as I do).

I'm concerned that the preparer is one of the least "obvious" parts of the resolution process - it's full of generic terms (not least of which is "prepare" itself!) which don't really give much clue as to what precisely is going on. Every time I need to deal with the preparer, I have to do a bunch of research to remind myself what's happening before I can proceed. I'm worried that if we delegate chunks of the "prepare" task to individual commands, we will end up making that problem worse, with everyone needing to keep track of what's happening.

To give an example, install.py now calls finalize_linked_requirements. It's not obvious to me why that call is needed. What exactly wasn't "finalized", and what's so special about linked requirements here? This is all stuff that was just as obscure before this PR (as I said, the preparer was never the most obvious bit of code!) but previously, the install command (and more importantly, maintainers working on that command) didn't need to care. Now it does. Why? What changed? The wheel and download commands used to call the (impressively uselessly named!) prepare_linked_requirements_more method, but install never did. I'm guessing that the point here is to pull out the stuff that everyone did into an explicit method, but then allow install to pass the "dry run" flag to say "don't download actual dists". But I'm working that out from the comments in this PR and seeing the call in the context of this PR. If I were just looking at the install command for some reason, long after this was merged, I wouldn't have that information.

I'm sorry, I know that I'm making this PR somehow responsible for the big maintainability mess that is the preparer, and that's not entirely fair. But one of the reasons this PR isn't getting eyes on it for review is because of that maintainability mess, and I honestly think we need to start paying back some of that technical debt if we're to have any chance of progressing on bigger (and important!) work like this. In particular, performance improvements tend to increase technical debt (you're explicitly trading speed against maintainability) and I think we need to be cautious here, given the current struggle we have with maintainer time.

I'd love to see this land so it can then unblock some of your other performance improvement PRs

I'm not sure this will unblock things as much as we'd hope. The problem is simply that all of these changes are huge - 21 files for this one, 38 and 43 for the two you linked. Getting time to review changes on this sort of scale is a problem in itself (for me, at least - I can't speak for the other maintainers). A better approach might be to refactor the PRs into simpler, step-by-step changes, each of which has a small, manageable, scope and which can be accepted individually, both reducing the technical debt and incrementally moving towards these end goals. For example a change that renamed and documented prepare_linked_requirements_more would be much easier to review and accept, and would pave the way for a follow-up that replaced it with better-factored components. And so on.

pfmoore avatar Feb 13 '24 12:02 pfmoore

I'm not sure this will unblock things as much as we'd hope. The problem is simply that all of these changes are huge - 21 files for this one, 38 and 43 for the two you linked

The other two PRs are stacked PRs (so they don't change as many files as that) - see the PR description for the messages like:

This PR is on top of https://github.com/pypa/pip/pull/12186; see the +392/-154 diff against it at https://github.com/cosmicexplorer/pip/compare/metadata_only_resolve_no_whl...cosmicexplorer:pip:link-metadata-cache?expand=1.

edmorley avatar Feb 13 '24 13:02 edmorley

@pfmoore I've read through your last comment a few times and I'm not sure I understand if there is a way forward or not?

Would you prefer the PRs to be further broken down into something smaller or would you like something bigger that fixes a lot of the genericness of the preparer logic? Or are you saying that there's too much technical debt in this part of pip for anyone other than a maintainer to touch it?

I had some optimization ideas that are orthagonal to the ones created by this series or PRs. But it would touch a lot of the same logic and I didn't want the PRs to step on each others toes.

notatallshaw avatar Feb 19 '24 14:02 notatallshaw

Would you prefer the PRs to be further broken down into something smaller or would you like something bigger that fixes a lot of the genericness of the preparer logic?

Neither. I'd prefer a series of smaller commits that make the preparer more maintainable. Is that what you mean by "fixes a lot of the genericness"? I was thinking about steps (each in its own PR) like renaming functions to be more meaningful, adding/improving comments and docstrings to explain the logic, factoring out complex code into (named, easier to understand) standalone functions, encapsulating conditionals in functions so that the calling code needs less branching, etc. Standard "refactoring" types of improvements.

Once those have been done, I'd hope that this PR can be refactored into a number of similar smaller PRs, each tackling a part of the task without needing the whole thing to be done in a "big bang".

Or are you saying that there's too much technical debt in this part of pip for anyone other than a maintainer to touch it?

No, I'm saying that there's so much technical debt, it's hard for anyone to touch it. Getting reviews is hard because it needs a second person to put the effort into understanding, and that's hard because of the technical debt. If we were in a better situation, committer reviews would be easier because the committers would be sufficiently comfortable with the code to do a review without needing to re-learn the underlying logic.

In fact, I think refactoring to reduce technical debt might be easier for an external committer, because they are coming at the code with no preconceptions, and with a fresh viewpoint. When I look at the code, all I think is "ugh, what a mess" 🙁

But it would touch a lot of the same logic and I didn't want the PRs to step on each others toes.

That's precisely the sort of over-coupling that the complexity of the existing code results in.

I don't want to make "fix the technical debt" into some sort of prerequisite or "tax" that must be paid by someone like yourself who simply wants to contribute improvements to pip. We definitely can't afford to discourage that sort of work! But I appreciate the frustration you feel with how hard it is to get reviews, and this is the best way I can think of for moving the "getting a review" problem forward. If you'd prefer not to get caught up in the technical debt issue, by all means ignore those parts of my comments, and simply wait until one of the core maintainers (which might well be me) gets time to do a proper review of the code here.

pfmoore avatar Feb 19 '24 17:02 pfmoore

@pfmoore thanks so much for the clear feedback. The amount of merge conflicts alone (10 separate files before rebasing just now) is a strong indicator that this change should be split apart. I am focusing particularly on your comments at https://github.com/pypa/pip/pull/12186#issuecomment-1941404252:

I'm concerned that the preparer is one of the least "obvious" parts of the resolution process - it's full of generic terms (not least of which is "prepare" itself!) which don't really give much clue as to what precisely is going on. Every time I need to deal with the preparer, I have to do a bunch of research to remind myself what's happening before I can proceed. I'm worried that if we delegate chunks of the "prepare" task to individual commands, we will end up making that problem worse, with everyone needing to keep track of what's happening.

This is definitely something I am hoping to improve with the series of changes I will split out from this PR.

To give an example, install.py now calls finalize_linked_requirements. It's not obvious to me why that call is needed. What exactly wasn't "finalized", and what's so special about linked requirements here? This is all stuff that was just as obscure before this PR (as I said, the preparer was never the most obvious bit of code!) but previously, the install command (and more importantly, maintainers working on that command) didn't need to care. Now it does. Why? What changed? The wheel and download commands used to call the (impressively uselessly named!) prepare_linked_requirements_more method, but install never did. I'm guessing that the point here is to pull out the stuff that everyone did into an explicit method, but then allow install to pass the "dry run" flag to say "don't download actual dists". But I'm working that out from the comments in this PR and seeing the call in the context of this PR. If I were just looking at the install command for some reason, long after this was merged, I wouldn't have that information.

This part is crystal clear to me--the refactoring + new functionality (resolving #12603) is absolutely far too many things to change in one PR. I think I can separate this into at least three changes:

  1. Refactor dist caching in InstallRequirement. No change in external behavior.
  2. Remove prepare_linked_requirements_more() and replace with finalize_linked_requirements(). Further documentation/abstraction of the state machine involved in the preparer. No change in external behavior.
  3. Make --dry-run avoid downloading when metadata is available. Resolves #12603.

cosmicexplorer avatar Jul 20 '24 17:07 cosmicexplorer

I was able to split this in two, and I believe the result cleanly separates refactoring from feature work. Please take a look at #12863 for the refactoring change with effective documentation; I have set the current change to depend on it and have set this into draft mode until we've all had time to reach consensus on #12863. Thanks again for the wonderful feedback; I really do love contributing to this project!

cosmicexplorer avatar Jul 20 '24 21:07 cosmicexplorer