get-pip icon indicating copy to clipboard operation
get-pip copied to clipboard

Distribute a zipapp version of pip

Open pfmoore opened this issue 2 years ago • 6 comments

This PR adds generation of a .pyz version of pip to the build script. Two copies of the .pyz are created in the public directory - an unversioned pip.pyz and a versioned pip-22.1.2.pyz. My assumption is that users will typically fetch the unversioned zipapp (and that's what our documentation should suggest) but people who want to pin a specific pip version can fetch the versioned zipapp.

As we don't do any deletion of generated files, older versioned zipapps will accumulate over time.

We could generate older versions, by fetching the wheels and rebuilding the zipapp. I'm not sure if this is worth it, but it might be if we ever plan on adding removal of previously generated files. In that case, we should probably decide on a policy for what zipapp versions we retain (regenerate) - I'd suggest versions for the current and previous year, as a simple rule.

This PR doesn't include a generated zipapp. I wanted to get the generation script sorted out and agreed, and then work on publicising the new method in the pip documentation, before actually publishing anything. Note we should probably not merge this before the pip 22.2 release, to give us time to do the pip side of things (which involves documenting the new distribution option, basically - but getting the messaging right is important so I don't think we want to rush this into 22.2).

pfmoore avatar Jul 11 '22 10:07 pfmoore

Whelp, before we merge this, we need to update the CI checks to account for untracked files being generated by the generate script.

pradyunsg avatar Jul 15 '22 13:07 pradyunsg

Thanks, I wasn't aware of that. Added to the list. My current plan is to pick this up after the 22.2 release, at which time we can sort out details like

  • documenting and publicising the zipapp
  • finalising the deployment process and merging this PR
  • deciding whether we want to test the zipapp in pip's CI.

Edit: Wait, CI ran on this PR and didn't fail. So what exactly do you mean by accounting for untracked files?

pfmoore avatar Jul 15 '22 15:07 pfmoore

I think the CI should fail, if files that are going to be generated by the generation script are not checked in. Right now, it's not doing that.

In other words, the zipapps generated are not staged/committed in this PR right now. Thus, when running the script would generate it, there are new files generated in the working/public directory that aren't checked in. This means that we can have a discrepency between what's checked in and what's auto-generated, which isn't great.

The CI check is designed to ensure that our copies of get-pip.py are correct/up-to-date, for the version of pip that's been released. It, however, does not ensure that the newly-generated files are also included in the repository. I reckon, it should -- which is what I was describing in the previous comment about it needing an update. The zipapp files that this would generate would be marked as "untracked" by git right now and we should have the CI fail when this happens because it's got untracked files in the working directory after generation (similar to how pip's vendoring would fail if you modified a file in the _vendor directory).

pradyunsg avatar Jul 15 '22 21:07 pradyunsg

FWIW, another example of the sort of situation this would trip on would be when adding 3.7 to the mapping on top (i.e. logic for generating public/3.7/get-pip.py), but not adding the generated file. That's a similar sort of "generation script generates an untracked file and CI does not complain".

pradyunsg avatar Jul 15 '22 21:07 pradyunsg

Ah, OK, so you're saying we need to improve CI, not specifically for this PR but in general. And we should do that before merging this because - I'm not quite sure why, but because this PR means we'll have more risk of forgetting to check in the newly generated files?

I think I get this, but I'm somewhat hazy on the release process, and I'm not sure I can safely play around with it because the noxfile seems to actually do a push - plus, it's doing signed releases and I'm not set up with any signing mechanism, so I don't think I can even sign a release in the first place...?

I guess this has changed a lot singe I last did a release, as I don't remember any of this. But as I say, I'll pick this up after 22.2 is out.

pfmoore avatar Jul 15 '22 22:07 pfmoore

I'm not quite sure why, but because this PR means we'll have more risk of forgetting to check in the newly generated files?

Because this PR is adding new generated files (to the generation scripts) but not to the actual repository. This would mean that the next time someone runs the release automation, they'll also check in the additional zip app (if we land this before 22.2).

pradyunsg avatar Jul 16 '22 08:07 pradyunsg

OK, I'm just coming back to this.

@pradyunsg what do you want to do about untracked files? As things stand, everything is fine (as far as I can tell by simply desk-checking the release scripts - see above regarding how it doesn't seem possible to test them). The generate script creates the new zipapp, the release process will add it (using git add .). The CI doesn't check new files, I understand that. But I don't see that as blocking this PR - instead, I see this PR as illustrating that the approach in #159 needs some extra thought about how we handle new files being created by the generate process. I'm happy to have that discussion, but I see it as independent of this PR.

I'll add some thoughts to #159. But are you going to insist that this PR must wait until #159 is merged in a form that handles new, untracked files?

pfmoore avatar Sep 03 '22 10:09 pfmoore

Test failures are because of #170.

pfmoore avatar Sep 03 '22 15:09 pfmoore

Added the zipapps to the PR, as per comments on #159

pfmoore avatar Sep 04 '22 16:09 pfmoore

Hmm, why are the zipapps different when generated here? Maybe Windows vs Unix weirdness?

pfmoore avatar Sep 04 '22 16:09 pfmoore

Grr, it's file timestamps.

pfmoore avatar Sep 04 '22 16:09 pfmoore

Sigh, something else as well. I'll have to look at this later. But I will note that I don't think that checking that the files are binary-identical is a particularly good test here. It may be that it's flagging a real issue, but (as the timestamps show) it's just as likely it's a functionally irrelevant discrepancy.

pfmoore avatar Sep 04 '22 16:09 pfmoore

I’d argue that we should also have signatures and checksums on these files, at some point in the future (like we should for get-pip.py, as the open issues suggest), so generating them in a stable/reproducible manner is important. Besides, if we don’t have this be reproducible, it’d mean that we’d regenerate all of these files on every instance that we run the generate command (i.e. new binary files being checked into git for every pip release).

Finally, I’ll remind you that all these checks and mechanisms were originally designed for the existing text files and making sure those are kept up to date (those are generated in a reproducible manner, though they bake in an existing “stable” wheel). I didn’t think of binary blobs when I was writing this CI. :P

pradyunsg avatar Sep 04 '22 17:09 pradyunsg

Finally, I’ll remind you that all these checks and mechanisms were originally designed for the existing text files and making sure those are kept up to date (those are generated in a reproducible manner, though they bake in an existing “stable” wheel). I didn’t think of binary blobs when I was writing this CI. :P

Understood, I wasn't suggesting the checks themselves were at fault, just that they don't take into account the subtleties of binary files - and I'd not been aware of the check when I made this change, so I didn't anticipate needing to deal with it.

Regarding signatures and checksums, I see no issue with checksums, but I'm not sure how signatures would work. Do we have a "PyPA/pip" signature that we'd use for this? Expecting individual committers to use a personal signature doesn't seem ideal (I note that I can't currently do a release here, as the release automation does git tag --sign and I don't have a GPG signature). I think this is something that probably needs discussion between the pip committers somewhere - but I don't know what would be a good forum for that.

pfmoore avatar Sep 04 '22 20:09 pfmoore

OK, that's fixed the reproducibility issue. Sorry about the complaints re the checks. I'm still not sure I follow the logic where we check in the generate script and the files, and then we have CI that checks that the generate produces the files we just added (I don't really understand what we're trying to check against) but that doesn't mean it's OK for me to come in after the fact and complain about it just because I don't like doing what it requires. Blame the fact that I was trying to do things in a hurry and got frustrated (not an excuse, I know).

Let me know if you have any further concerns with this PR.

pfmoore avatar Sep 04 '22 20:09 pfmoore

As there have been no further comments, and I don't want to leave this until the next pip release is looming, I'd like to merge this as it stands and address any CI questions via followups. @pradyunsg are you OK with that? If so, could you approve the PR? Having requested your review, I'd rather not merge without your approval.

pfmoore avatar Sep 17 '22 09:09 pfmoore

Unfortunately, I just found some issues in manual testing. I could have sworn I tested with 3.9/3.8, but maybe not (and what's more, the pip zipapp tests are only for 3.10).

The problem is in the stdlib zipimport with a subdirectory:

import sys
import importlib.resources as ir

sys.path.insert(0,"../public/pip.pyz/lib")
print(len(ir.read_text("pip._vendor.certifi", "cacert.pem")))

This fails with an exception in Python 3.9 and earlier. As 3.9 is no longer receiving non-security fixes, we're not going to get this fixed.

The issue seems to be with using a "lib" subdirectory in the zipapp and adding that to sys.path. So I've removed that directory layer from the zipfile and the top level now contains __main__.py and pip/, and we just add the pyz to sys.path. I can't think of a way that this will cause an issue.

I've also added a (very simplistic) test that the zipapp works (i.e., runs pip or reports an incompatibility error) on all Python versions we test against, just to ensure we don't break this in the future.

pfmoore avatar Sep 17 '22 16:09 pfmoore

then we have CI that checks that the generate produces the files we just added (I don't really understand what we're trying to check against)

I guess I never responded to this and I'll try one last time...

Think of this similarly to pip's vendoring CI check. It's primarily because we aren't the only people who might open a PR here (wherein we know exactly how stuff works) and to reduce the effort on the reviewers' end that the scripts included in the commit match what would be generated + work as advertised.

Over on pip, we check that every PR that modifies vendoring files changes them in a way that's repeatable through our vendoring process which maintains clear patches and has pinned Python versions. This is validated in CI, so that means that any changes made in vendored files are clearly tracked in our patches and vendor.txt file.

It's the same here even though it's useful for somewhat different reasons -- we're effectively embedding blurbs of opaque data into scripts that'll get executed in thousands of machines daily, and it's useful to have those opaque blurbs be generated in a manner that they can be validated and to ensure that the changes are reflected in both the templates and the generated scripts simultaneously. It also ensures that if we change either, we change both and that they evolve in sync.

In effect, the public directory is like pip's own _vendor directory. If we make changes there, we want those to happen because the generation inputs were changed (here: templates and the script, there: vendor.txt and patches) and that changes to those inputs are reflected in the output.

pradyunsg avatar Sep 17 '22 19:09 pradyunsg