Create Binary Transparency for Artifact Registries guide
Fixes https://github.com/ossf/wg-securing-software-repos/issues/47
woodruffw 16 hours ago said:
To tack onto what @haydentherapper said: PyPI's behavior is subtle in that filenames are always unique and immutable on PyPI, but releases themselves are not. In other words: a project foo that gets deleted or turned over to a new user can't overwrite foo-1.2.3.tar.gz if that distribution file was already uploaded by a previous maintainer, but the new maintainer might be able to upload foo-1.2.3-py3-none-any.whl or similar if no wheel was previously uploaded for that version of foo.
In practice this means that resolving foo==1.2.3 isn't guaranteed to be stable for a given host, since the new maintainer can always upload a new (unique) file to the old version that's more specific/matches the host's target configuration more precisely.
That sounds extremely problematic to me. I expect that repositories must sometimes remove packages of a given version (e.g., by court order), but I think most users would expect that a given version# would be stable. This loophole is a great way to hide attacks. Is this functionality critical to PyPI somehow? Could PyPI be changed to prevent it (at least, say, after a day or so of the "initial" version being uploaded)?
@david-a-wheeler
Is this functionality critical to PyPI somehow?
This property is used because the ABI of CPython, architectures, and platforms (known in Python-land as "tags") aren't known in-advance and new ones are added over time. With this property Python packages can build and release new artifacts that are compiled for new "tags" without issuing a whole new release (since many times the source code doesn't change, only the tool for building).
I'm not sure how often this happens in practice and whether or not it's worth the additional risk because I don't maintain many packages that have these requirements.
This property also exists because artifacts are typically built in different processes so arrive at different times. There's currently no mechanism for "drafting" a release, so it'd be a race to get all your artifacts built before a timer expired.
I think adding support for "draft" releases would make removing this property of an index viable, but even then I am not sure of the impact for maintainers, needs more studying to be sure.
This loophole is a great way to hide attacks. Is this functionality critical to PyPI somehow? Could PyPI be changed to prevent it (at least, say, after a day or so of the "initial" version being uploaded)?
There's a long-ish thread on the current behavior here: https://discuss.python.org/t/restricting-open-ended-releases-on-pypi/43566
The TL;DR of it is that PyPI having "open-ended" releases is currently relied upon for some packaging workflows, e.g. there are maintainers who update their releases to contain wheels for new versions of Python rather than publishing an entirely new version with no functional changes. There's also some debate about how serious the vector is, given that (1) the attacker can only upload new files, not overwrite existing release files, and (2) could always just make a new release instead, given that Python as an ecosystem tends to avoid exact version-pinning.
But apart from that, +1 to everything @sethmlarson said, especially drafting -- there is a PEP that enables support for drafting on PyPI and other indices, and I've (very) recently been given the resources to begin work on actually implementing it 🙂
I'm not sure how often this happens in practice and whether or not it's worth the additional risk because I don't maintain many packages that have these requirements.
It happens quite often!
That sounds extremely problematic to me. I expect that repositories must sometimes remove packages of a given version (e.g., by court order), but I think most users would expect that a given version# would be stable. This loophole is a great way to hide attacks.
The risk is entirely mitigated by using lockfiles or hash-pinned requirements files.
I think adding support for "draft" releases would make removing this property of an index viable, but even then I am not sure of the impact for maintainers, needs more studying to be sure.
Draft releases definitely helps with the "artifacts come from different places at different times" issue, and is desirable for other reasons, but it doesn't resolve the "build new artifacts for old releases against new Python versions, ABIs or platforms" so it is not a panacea here.
Time and time again this working group struggles with terminology. In keeping with the working group name
It happens often, I'm afraid.
we usually use "repository" and "software repository" where this document uses "registry" and "artifact / package registry".
Obviously people differ in what they mean by terms. For example, I should note that I don't normally use these terms as synonyms. Here's how I normally use the terms:
A (package) repository stores information. PyPI and CPAN, for example, actually stores the packages that can be installed by a package manager. A source repository stores the source code that you might use (e.g., to build).
A registry is an "official" record of "where to get the information" - but often doesn't store the data itself. Registries redirect users to 1+ repositories. Depending on the registry, different components in the registry might be served by different repositories. I know quicklisp works this way, and I think others do too.
I don't claim everyone uses the terms the same way. That's part of the challnege here :-).
Let's standardize on one term in the document, and it might be a good idea to have a terminology / definition section near the front.
100% agree. Trying to get everyone to change terminology throughout the world to the same thing is er, hard. Documenting definitions of key terms, as they are used in the document, definitely sounds like a way forward.
Now for the tricky one! Generally speaking, our other guides cover an existing successful implementation and high-level guidance on how other repositories can implement it.
I think we're pretty close to this with PEP 740 for PyPI, but I agree that we might want to wait to publish until that has been fully baked and any unforseen issues are sorted out.