pip icon indicating copy to clipboard operation
pip copied to clipboard

Vendor resolvelib 1.1.0

Open notatallshaw opened this issue 1 year ago • 1 comments

Fixes https://github.com/pypa/pip/issues/12768 Fixes https://github.com/pypa/pip/issues/12317 Fixes https://github.com/pypa/pip/issues/12305 Fixes https://github.com/pypa/pip/issues/12497

Draft PR of resolvelib beta to see if it passes tests. Will mark as ready to review and ask for review/merge when final is ready.

notatallshaw avatar Oct 11 '24 22:10 notatallshaw

So I checked how resolution performed using https://github.com/notatallshaw/Pip-Resolution-Scenarios-and-Benchmarks, and I have good news and bad news.

Firstly, as expected, due to correctness fixes https://github.com/pypa/pip/issues/12768 and https://github.com/pypa/pip/issues/12317 resolve instead of showing ResolutionImpossible.

Some unexpected good news is this resolves https://github.com/pypa/pip/issues/12305.

Some slightly bad news is this turns https://github.com/pypa/pip/issues/12990 from a Build Failure to a ResolutionTooDeep, but this can be fixed if https://github.com/pypa/pip/pull/13017 lands.

Some worse news is that the requirements in https://github.com/astral-sh/uv/issues/1398 go from resolving (after some time), to ResolutionTooDeep, and I don't have an immediate optimization to fix this. It will require deeper investigation, but I don't think that should stop pip from vendoring resolvelib as it fixes correctness errors and other cases that do produce ResolutionTooDeep.

notatallshaw avatar Oct 18 '24 01:10 notatallshaw

Ready for review / merge once 24.3 is closed.

notatallshaw avatar Oct 31 '24 13:10 notatallshaw

It turns out I accidentally broke depth by changing some resolvelib behavior: https://github.com/pdm-project/pdm/pull/3235#issuecomment-2451150062

I am going to investigate further on this to see if it's causing the performance regression I see in https://github.com/astral-sh/uv/issues/1398, if so I'm going to make a PR on resolvelib, or if removing depth doesn't make any difference on performance then it might be worth dropping as a preference like PDM has.

notatallshaw avatar Nov 01 '24 12:11 notatallshaw

Okay, I've tried fixing depths and removing depths and I ran the scenarios in https://github.com/notatallshaw/Pip-Resolution-Scenarios-and-Benchmarks to see what is the difference (the left hand side is fixing, the right hand side is removing):

uv run compare.py --github-repo-1 "notatallshaw/pip" --git-commit-1 3334e33066c30bef609d54b4d888030ad5dbef61 --github-repo-2 "notatallshaw/pip" --git-commit-2  03d9fd86ab28e9b61066d0e5b1395ddc842b29f6
Reading inline script metadata from `compare.py`
Installed 8 packages in 10ms
Difference for scenario scenarios/problematic.toml - autogluon:
    	Number of requirements processed: 197 -> 177
    	Number of packages processed: 261 -> 241

Difference for scenario scenarios/problematic.toml - TTS:
    	Success: False -> True.
    	Failure Reason: Resolution Too Deep -> None.
    	Number of requirements processed: 143 -> 156
    	Number of packages processed: 357 -> 222

Difference for scenario scenarios/problematic.toml - boto3-urllib3-transient:
    	Number of requirements processed: 429 -> 141
    	Number of packages processed: 811 -> 2675  (Both ended up with ResolutionTooDeep)

Difference for scenario scenarios/problematic.toml - apache-airflow-beam:
    	Number of packages processed: 312 -> 161

Difference for scenario scenarios/problematic.toml - kedro-test:
    	Number of requirements processed: 335 -> 337
    	Number of packages processed: 522 -> 565

Difference for scenario scenarios/problematic.toml - sphinx:
    	Number of requirements processed: 71 -> 79
    	Number of packages processed: 166 -> 199

Difference for scenario scenarios/problematic.toml - backtracks-to-old-scipy:
    	Number of requirements processed: 74 -> 86
    	Number of packages processed: 111 -> 129 (Both end in build failure)

Difference for scenario scenarios/problematic.toml - django-stubs:
    	Number of packages processed: 36 -> 25

Difference for scenario scenarios/big-packages.toml - apache-airflow-all:
    	Not the same install files.
    	Number of requirements processed: 601 -> 597
    	Number of packages processed: 733 -> 696

Difference for scenario scenarios/big-packages.toml - acryl-datahub-all:
    	Number of requirements processed: 347 -> 334
    	Number of packages processed: 711 -> 504

There are some interesting takeaways:

  1. Using depth as a preference causes the TTS Scenario to result in ResolutionTooDeep, vendoring resolvelib 1.1.0 causes this scenario to resolve because it breaks the depth preference
  2. Some scenarios resolve faster with depth removed as a preference, some are faster with depth, except for ResolutionTooDeep cases it seems removing is in general better
  3. Cases where both hit ResolutionTooDeep (i.e. boto3-urllib3-transient) the number of packages processed (and hence the wall clock time) can end up being much bigger with removing depths because it goes deep on one package (in this case botocore because it has thousands of releases)
  4. Of the resolutions which succeeded only the apache-airflow-all scenario produced a different installation result, and without going into too much details it's because apache airflow doesn't put lower bounds on their providers packages: https://github.com/apache/airflow/issues/39100, making their resolution search space massive

My conclusions from this are:

  1. On the whole, removing depth seems to be a net positive, probably because resolvelib now backjumps and skips of unnecessary backtracking state, and it’s worth removing like PDM already has
  2. It may be worth reducing the maximum number of backtracking steps before hitting ResolutionTooDeep

notatallshaw avatar Nov 10 '24 19:11 notatallshaw

This is ready for review now.

I have removed depth as a preference, which unfortunately means removing the only unit tests that were unit testing the PipProvider (though it is tested greatly as part of functional tests), but I will add additional unit tests to get_preference, and friends, in a follow up PR (that will replace https://github.com/pypa/pip/pull/13017) once this lands.

notatallshaw avatar Nov 10 '24 20:11 notatallshaw

I'm marking this back to draft while we discuss performance issues with resolvelib 1.1, see discussion here: https://github.com/sarugaku/resolvelib/issues/180

At some point we might have to decide whether to vendor this version of resolvelib, which includes an important correctness fix, and some performance improvements, but some big performance regressions, or decide to wait for a resolvelib 1.2 which might be able to regain some of the performance back in very deep backtracking situations (e.g. boto3 / urllib3).

I would very much like input from other maintainers on that, both here and over on the resolvelib performance issue.

notatallshaw avatar Dec 14 '24 21:12 notatallshaw

@notatallshaw I'm generally in favour of correctness improvements as AFAIU the user usually has workarounds to deal with excessive backtracking / ResolutionImpossible, i.e. restrict the version ranges pip considers. In contrast, issues like https://github.com/pypa/pip/issues/12317 cause pip to have straight-up broken behaviour. A valid solution exists (and within the backtrack limit), but pip skips over it. Last time I had to debug this with a colleague, it was quite annoying and frustrating.

Also, there are additional optimizations in the works, right? I see that you have a draft optimistic backjumping PR on resolvelib's side, and then there's https://github.com/pypa/pip/issues/12317 (prefer direct conflicts on backtrack) on pip's end.

How common are the known scenarios where this PR is a net negative? As long as they're not common, I'd say the small amount of fallout doesn't outweigh the correctness improvements.

cc @pfmoore @sbidoul

ichard26 avatar Jan 12 '25 04:01 ichard26

I agree with @ichard26 - we should prioritise correctness over performance. Claiming there is no solution when there is one, is a bug, and we should prioritise fixing that even if it requires a slower approach.

Pip's focus is on standards compliance and correctness, with performance being secondary (although clearly important as long as the first two criteria don't suffer). If users need performance and are willing to compromise on the other factors, then they are likely to be choosing uv anyway (and even if they prefer to use pip in general, "use uv" is a reasonable workaround for cases where pip's performance is an issue and strict correctness doesn't matter).

pfmoore avatar Jan 12 '25 14:01 pfmoore

Thanks both your feedback, the problem here is performance means the user will see a "ResolutionTooDeep", there are definitely real cases involving boto3 and urllib3 as dependencies where this will happen with resolvelib 1.1.0 where it did not happen in the previous version. So if this is merged we are likely to see additional users complain.

If other maintainers are comfortable with that, knowing we are achieving correctness as best we can, at the cost of some users hitting resolution problems for now, then I am going to mark this as ready for review, if we can merge soon, I can create another PR to add a few simple performance improvements with the new API before pip 25.0.

I took a look at implementing a fallback inside resolvelib but also found it would cause at least one known "ResolutionTooDeep" that resolvelib 1.1.0 doesn't 🙁. So I'm still on the fence about that.

I'm going to take a deeper look in a few weeks if we can have our cake and eat it, i.e. introduce this optimization while remaining formally correct, but my conclusion may end up being that the resolvelib API needs to change in non-trivial ways, or we need to consider writing a new resolver, one that learns the lessons from resolvelib, uv, and Poetry.

notatallshaw avatar Jan 12 '25 14:01 notatallshaw

Here's to another day where I am annoyed that botocore has thousands of releases... even if it keeps us honest.

I'll be blunt, I think that "thousands of releases" should stress a backtracking resolver (or for that matter any form of resolver). If users are writing requirements that state that any of many thousands of botocore releases are equally valid, then I'd question those requirements. Obviously, no-one is actually going to bother trying to add lower bounds to every botocore requirement out there - and it only takes one unbounded requirement, selected at the wrong point in the process, to blow up. But if we're optimising for the use case of "please pick any one of these 3,000 candidate versions" then something is seriously skewed[^1].

Maybe what we should have is some sort of limit in the finder - by default, only consider the latest 100[^2] versions of a project. Then, people using a pathological project like botocore would have to explicitly opt-in with --max-candidates=2000. This would cause some users pain, I'm sure, but it would put the focus on the actual cause of the problem, which is projects that publish unreasonable numbers of releases.

[^1]: And IMO, what's skewed is botocore's release policy, not our resolver 🙁 [^2]: I picked that number out of the air - if I were doing this for real, I'd do some analysis before choosing the default.

pfmoore avatar Jan 12 '25 18:01 pfmoore

I'll be blunt, I think that "thousands of releases" should stress a backtracking resolver (or for that matter any form of resolver). If users are writing requirements that state that any of many thousands of botocore releases are equally valid, then I'd question those requirements. Obviously, no-one is actually going to bother trying to add lower bounds to every botocore requirement out there - and it only takes one unbounded requirement, selected at the wrong point in the process, to blow up. But if we're optimising for the use case of "please pick any one of these 3,000 candidate versions" then something is seriously skewed

Setting aside the problem of thousands of releases, these complex dependency graphs naturally arise from how Python packaging handles dependencies. For example, Python package dependencies are an NP hard problem, other languages (e.g. rust and NodeJS) sidestep this by allowing multiple versions of the same library as a solution. Even without boto3 there are many other examples where resolution is very hard, just boto3 sticks out more obviously because it can cause a huge number of downloads.

Maybe what we should have is some sort of limit in the finder - by default, only consider the latest 1002 versions of a project. Then, people using a pathological project like botocore would have to explicitly opt-in with --max-candidates=2000. This would cause some users pain, I'm sure, but it would put the focus on the actual cause of the problem, which is projects that publish unreasonable numbers of releases.

I'm dubious but would be happy to discuss elsewhere, I think this would break resolutions in a non-obvious way. Let's say you have some package foo that has versions 1 to 10'000, and out limit is 1000. If the user depends on A and B, and the latest version of A depends on foo > 5000 and the latest version of B depends on foo < 8000, the user would get a ResolutionImpossible but only after backtracking through every version of B, unless an old version of B didn't put an upper bound on foo but was functionally incompatible the user would find a solution but one in which B was functionally broken.

And the problem is the number of possible solutions can grow exponentially along dimensions other than number of candidates, e.g. number of requirements. So this doesn’t work when the number of requirements are causing the issue, in which case you may need a very small number of max candidates (e.g. 10) to avoid a ResolutionTooDeep, and you may just get a ResolutionImpossible instead and you would have to do significant analysis to understand why.

notatallshaw avatar Jan 12 '25 18:01 notatallshaw

Setting aside the problem of thousands of releases, these complex dependency graphs naturally arise from how Python packaging handles dependencies.

Agreed 100%. I'm not trying to minimise the complexity of the problem here, but I think that packages like botocore with thousands of releases are (and should be treated as) very much outliers. I don't think we should compromise correctness, or the performance of simpler cases, just to accommodate such projects.

I'm dubious but would be happy to discuss elsewhere, I think this would break resolutions in a non-obvious way.

It's quite possible it would. I haven't done any sort of analysis at this point. I'm happy to drop the idea.

pfmoore avatar Jan 12 '25 22:01 pfmoore

I'm moving this to 25.1, I'm not comfortable landing this without being able to follow it up with related performance improvements, and I have had an unusually busy January and have not had time to make that PR with those improvements.

So, I don't want this landing right before a milestone deadline, so will push to merge this early in the 25.1 cycle, as soon as 25.0 is considered done.

notatallshaw avatar Jan 23 '25 17:01 notatallshaw

I updated the news item to but the note about resolutions into it's own bugfix item, similiar to what we did for the packaging bugfix for the Packaging bugfix last release.

I would like to get this merged early in the 25.1 release cycle so I can work on follow up PRs to take advantage of the new narrow_requirement_selection method.

notatallshaw avatar Feb 09 '25 18:02 notatallshaw

I'm going to leave this open for a couple more days and then merge, unless anyone objects.

notatallshaw avatar Feb 17 '25 04:02 notatallshaw