cryptography icon indicating copy to clipboard operation
cryptography copied to clipboard

Provide verifiably reproducible wheels on PyPI

Open tabbyrobin opened this issue 7 months ago • 7 comments

It would be nice if the PyPI-hosted wheels were verifiably bit-for-bit reproducible. I've experimented with the codebase and I've identified a few changes that would be needed to make this happen. If these changes are welcome, I can prepare some pull requests.

Short-term:

  • Either replace (some) calls to uv build with other commands (maybe uv venv + ... + pip wheel), or wait for a fix to https://github.com/astral-sh/uv/issues/13096
  • Deterministically normalize the .whl ZIP metadata (timestamps, permissions etc.)
  • Would also be nice: Change (some) calls to actions/checkout to instead use git commands directly. (This would facilitate verifying the reproducibility, which involves executing the GitHub actions yaml locally using a local GHA-runner, such as Nektos Act.)

Long-term:

  • Move complex build logic out of GHA yaml and into more general tools. The GHA yaml would then be a thin wrapper around that logic. This would facilitate verifying build reproducibility locally, without relying on local GHA-runners like Nektos Act.
  • Add CI tests comparing checksums of builds done twice in a row. This would catch reproducibility regressions.

(Note: For the moment, I have only focused on the Linux wheel builds.)

Somewhat related issue: #12764

tabbyrobin avatar Apr 27 '25 00:04 tabbyrobin

While I think reproducible builds are generally good, our builds are just invoking upstream tools:

      - name: Build the wheel
        run: |
          if [ -n "${{ matrix.PYTHON.ABI_VERSION }}" ]; then
              PY_LIMITED_API="--config-settings=build-args=--features=pyo3/abi3-${{ matrix.PYTHON.ABI_VERSION }}"
          fi

          OPENSSL_DIR="/opt/pyca/cryptography/openssl" \
              OPENSSL_STATIC=1 \
              uv build --python=/opt/python/${{ matrix.PYTHON.VERSION }}/bin/python --wheel --require-hashes --build-constraint=$BUILD_REQUIREMENTS_PATH $PY_LIMITED_API cryptography*.tar.gz -o tmpwheelhouse/
        env:
          RUSTUP_HOME: /root/.rustup
      - run: auditwheel repair --plat ${{ matrix.MANYLINUX.NAME }} tmpwheelhouse/cryptography*.whl -w wheelhouse/

So from that perspective, I believe the correct thing here is for upstream tools to either have deterministic builds by default, or provide a --deterministic and then we can take advantage of it.

But I don't think it makes sense for every single package to maintain its own logic for mucking with timestamps.

alex avatar Apr 27 '25 00:04 alex

Thanks for your response @alex.

I agree that it's ideal for upstream tools to handle as much as possible. I am in the process of filing relevant issues with upstream tools.

Would this project be favorable to some of the other changes I suggested? In particular:

  • actions/checkout => git commands (not all instances, just some)
  • Long-term: Move complex build logic out of GHA yaml and into more general tools.

tabbyrobin avatar Apr 28 '25 23:04 tabbyrobin

I'm pretty ambivalent-to-negative on replacing actions/checkout with raw git commands. Particularly for doing sparse checkouts, its' considerably simpler.

What complex build logic are you talking about? The wheel-builder is more or less entirely an invocation of uv build (+ audithweel on linux). Everything else is downloading the artifacts we need (openssl + the sdist) or testing the wheel. In general I'm supportive of putting stuff in scripts rather than yaml, but it's not clear what logic you want to move.

alex avatar Apr 28 '25 23:04 alex

The sparse checkout commands are, unfortunately, among the instances I would like to replace. The reason is: Any action/checkout that specifies a ref requires GH authorization (GITHUB_TOKEN etc.). This is really just a blemish in the actions/checkout implementation. But when running a build verification locally, on arbitrary hardware, it's very much non-ideal for the build scripts to require a GH account and access to the account auth.

I currently have a rough script that executes wheel-builder.yml locally, and one of the main sticking points is actions/checkout with refs.

As for my "long-term" proposal to move logic out of GHA, I don't yet have a detailed plan. But it's good to know that you'd generally be supportive of replacing the yaml with scripts. For me to answer that question in more detail, it would be good to know:

What tooling would you support moving to?

The motivating goal is to enable executing the exact same build logic locally on arbitrary hardware, without needing a GHA local-runner, and without maintaining an entirely separate build script that replicates the yaml. The idea is that there would be a portable core expressing the build logic, and then two thin frontend wrappers on top: one to run the build in-cloud, and one to run it locally.

(So the answer to your question might be "all of the build logic", but it also might not be -- I'd really have to get started, with concrete tooling, to know.)

For long-term goals, in theory I would ideally mean not just Linux, but also Mac OS and Windows.

tabbyrobin avatar Apr 29 '25 01:04 tabbyrobin

The wheel builder ought to be trivial, I'm not sure I see a need for anything more than a bash script (which, after all, is what the yaml is really doing).

alex avatar Apr 29 '25 01:04 alex

In addition to bash, how about Docker? And do you have any opinions about using cibuildwheel?

Having considered things a bit, I would suggest moving almost all meaningful logic from wheel-builder.yml.

That means this logic would move:

  • creation of sdists
  • manylinux matrix generation
  • logic for building each individual wheel (uv build, auditwheel, testing...)

This logic would stay in yaml:

  • any steps for actions/upload-artifact or actions/download-artifact
  • the NodeJS workarounds

(Local-build containers would instead use volume mounts to pass artifacts around.)

I am still just focusing on Linux, but for eventual Windows and Mac OS builds, that could also mean migrating the following to scripts (or just creating duplicate implementations):

  • https://github.com/pyca/infra/blob/main/.github/workflows/build-macos-openssl.yml
  • https://github.com/pyca/infra/blob/main/.github/workflows/build-windows-openssl.yml

Is it valuable to preserve current log granularity in the GHA web UI? I could expose each step as a bash function. And in the yaml, each step would be a bash one-liner (with no logic, just transparent passthrough of relevant args).

tabbyrobin avatar May 03 '25 19:05 tabbyrobin

The linux jobs are already running in docker containers, there should be no need to nest docker containers further.

The sum total of the sdist building is uv build --build-constraint=$BUILD_REQUIREMENTS_PATH --require-hashes --sdist, so I'm not sure you're imagining, but I don't really see anything to abstract out there.

At this point, it may make more sense for you to produce a working PoC. I can't promise we'll accept it, but it's becoming difficult to discuss these things in the abstract and I'm having trouble following what you're proposing.

As a note, anything that can be structured as a small cleanup PR is most likely to be accepted.

alex avatar May 03 '25 19:05 alex

Given the lack of progress here, and the fact that if/when uv and auditwheel produce deterministic wheels we will automatically take advantage, I'm going to wontfix thing.

alex avatar Oct 04 '25 12:10 alex