warehouse icon indicating copy to clipboard operation
warehouse copied to clipboard

Refactor Upload + Better Metadata Handling

Open dstufft opened this issue 2 years ago • 4 comments

This is still a work in progress, none of the tests are working/updated and there's still pending work to be done (see everything with a FIXME comment), opening this now as a draft though to give people a chance to start looking at it and giving their thoughts.


The upload endpoint in Warehouse is a gnarly endpoint, It's currently a single ~1600 line file and it is not well factored. Following the logic of what is happening and where it's happening requires careful reading of a large view function, where different parts have various inter-dependencies that are not particularly clear.

To make matters worse, the metadata handling in the upload endpoint is using a bespoke metadata implementation, which sometime differs from how other systems validate, and due to the historical "shape" this API took, the metadata that we're validating and storing isn't actually the metadata that clients will be using-- that metadata lives in the uploaded artifacts, but instead is sent by the upload client alongside the upload [^1].

So what does this Pull Request do? It refactors/revamps the uploading handling to try and fix all of these issues to move us to a better overall standing.

Concretely, it does the following:

  • Breaks the upload() view up into smaller functions, making it easier to follow the overall flow of the upload function without getting bogged down into details and making the requirements/dependencies that these "sub sections" have clearer and more obvious.
    • We attempt to minimize the back and forth between validation, protocol handling, and constructing database objects.
  • Extracts the metadata out of the artifact, and uses that to populate the database.
    • Currently this only works for wheel files, but we are setup to add support for sdist too with Metadata 2.2.
    • Falls back to using the metadata from the form data as we do today, but only for artifacts that we know we can't extract metadata from (i.e. we don't treat failing to extract from a wheel as a non-error that falls back to form data).
  • Uses packaging.metadata as the canonical representation of our metadata and to handle validation of the metadata.
    • We still layer our own validation on top of packaging.metadata, because we have extra constraints that are special to Warehouse.
  • Fixes the metadata handling for some fields that we were either accidentally ignoring (we support up to metadata 2.2, but we missed implementing some fields) or where we were incorrectly handling the values (multi use fields being treated like single use fields, etc).
  • Stops supporting md5 digest as a valid digest to verify an uploaded file.
    • We still compute/store it, but we no longer accept uploads that only have a md5. They need to have a sha256 or blake2_256.
  • Stops supporting uploads for ancient releases where Release.canonical_version returns multiple objects [^3].

This should mean that the metadata that we record in Warehouse is much more likely to match what installers like pip will see when introspecting the artifact itself.

It also means that Warehouse is going to be more strict in what it accepts, because the metadata parsing in packaging.metadata has been carefully written to avoid silently allowing possibly ambiguous data (and as far as I know, it's the only parser that currently does that). That means that cases like:

  • Multiple uses of single use keys will be errors (currently all other parsers just pick either the first or last value).
  • Unknown fields will be errors (currently all other parsers just skip them).

Because we're extracting metadata out of the artifact rather than using the form data (where possible) we had to change the order of operations, which previously looked something like:

  1. Process / Validate the metadata (from the multipart form data) and the "upload data" (file hashes, package type, filename, etc).
  2. Get or create the Project declared in the metadata.
  3. Check if the request is authorized to upload to Project.
  4. Check if the description can be rendered.
  5. Get or create the Release for the version declared in the metadata.
  6. Do more validations around the filename (project-version.ext? invalid characters, etc).
  7. Buffer the uploaded file to a temporary location on disk.
  8. Do more validations (valid dist? duplicate file? etc).
  9. Check if the file is a wheel, and do more validations that the filename of the wheel is valid.
  10. If the file is a Wheel, extract the METADATA file.
  11. Create the File object in the database.
  12. Upload the artifact + metadata to S3.

However, the new order of operations looks more like this:

  1. Process / Validate the "upload data" (file hashes, package type, filename, etc) and only the project name and version (from form data).
    • This includes all filename validation that we can do without access to the database or metadata besides name + version.
  2. Get or create the Project from the name we got from the form data.
  3. Check if the request is authorized to upload to Project.
  4. Do any validations of the filename that requires access to the database (duplicate file checks, etc).
  5. Buffer the uploaded file to a temporary location on disk.
  6. Validate that the file itself is a valid distribution file.
  7. Extract the METADATA file (if we can, currently wheel only).
  8. Construct a validated packaging.metadata.Metadata object, using the extracted METADATA or the form data if the dist isn't the kind we can extract METADATA from.
    • This includes fully validating the metadata with any additional rules we add onto it that don't require access to the database.
    • This also includes checking if the metadata.description is able to be rendered.
  9. Get or create the Release for the version declared in the metadata.
  10. Create the File object in the database.
  11. Upload the artifact + metadata to S3.

We do end up shifting more of the filename validation to happen prior to ever buffering the uploaded file, which should allow those particular checks to bail out faster and do less work. However, we do shift metadata validation to happen after we've buffered the uploaded file, which will delay those particular checks to later in the request/response cycle [^2].

Another subtle change is that by moving the duplicate file check prior to buffering the uploaded file to disk, we have to implicitly trust that the sha256 and/or blake2_256 digest that the client provides us is accurate when deciding to no-op the upload. This should be perfectly safe as we treat the entire upload as a no-op (including dooming the transaction) so the most a malicious client can do is trick Warehouse into either either turning an error into a no-op, or a no-op into an error.

Things still left to do:

  • [ ] Sanitize the Metadata of UNKNOWN values.
  • [x] Actually drop support for md5 as a "verifying digest".
  • [x] Implement support for pulling version out of the form data and moving all of the name/version checks to happen prior to buffering the file.
  • [ ] Sort out the best place to render the description (it would be nice to include it in Metadata, but the Metadata class doesn't have support for the rendered description).
  • [ ] Migrate the database to correctly handle the incorrectly handled Metadata 2.2 or earlier fields.
  • [ ] Determine how to update BigQuery to correctly handle the incorrectly handled Metadata 2.2 or earlier fields.
  • [ ] Validate that the name/version from the form metadata matches the name/version from the metadata.
  • [ ] Finish error handling
  • [ ] Update / Write tests.
  • [ ] https://github.com/pypi/warehouse/issues/14717

I have more improvements to this I plan to do as well, but I'm going to keep them out of this PR since it's already big enough.

[^1]: The most popular tool for uploading is twine, which just reads the METADATA or PKG-INFO files and then sends that along. In theory this should be equivalent to us extracting the METADATA files ourselves. This isn't actually always true in practice though, any time the upload client reads the metadata they risk transforming it in incompatible ways, missing fields, etc that is hidden from us unless we look at the METADATA file itself. [^2]: While this PR currently ignores all of the core metadata that is being sent in the form data besides name/version, we could still consume that data and validate it prior to buffering the uploaded file to disk, then extract the METADATA file, parse it, and ensure that it matches the metadata that was sent in the form data.

  This PR chooses not to do that because the metadata in the file is the authoritative copy, and if that differs from the form data it likely means that the upload client had a bug... but we can choose whether or not we turn that bug into an error or just silently doing the right thing... so we just silently do the right thing.

[^3]: The structure of this made things more difficult to refactor out, and it's been like 10+ years since we allowed uploading multiple distinct versions that all canonicalize to the same thing, so nobody should be hitting this today unless they're trying to release something to an ancient version.

dstufft avatar Oct 08 '23 22:10 dstufft

Note: I opened https://github.com/pypi/warehouse/issues/14717 to discuss a change I'm planning to make that affects this PR.

dstufft avatar Oct 09 '23 00:10 dstufft

I guess I should also mention, once I have this PR in a state that I'm happy with as an outcome, I'll probably try and break this up into smaller PRs to make them easier to review, it was just easier to figure out how this should look lumping it all together.

dstufft avatar Oct 09 '23 00:10 dstufft

Blocked on https://github.com/pypa/packaging/issues/733

dstufft avatar Oct 09 '23 02:10 dstufft

Also blocked on https://github.com/pypa/packaging/pull/736.

dstufft avatar Oct 09 '23 06:10 dstufft