warehouse icon indicating copy to clipboard operation
warehouse copied to clipboard

Add copy of `release_files.path` to `file_registry` for long-term storage

Open miketheman opened this issue 8 months ago • 3 comments

Background

A Filename registry is maintained to persist even after files have been removed, to prevent re-upload, re-use of that exact filename.

When a release's files are removed, the ability to surface their storage location is effectively lost, as the path generator tool uses hashers to determine the placement from the file's hashes, which are also not preserved. https://github.com/pypi/warehouse/blob/aeb9ccd0d6eb494a105668ea2798edd925a80712/warehouse/forklift/legacy.py#L1525-L1536

The path data exists in the BigQuery Project Metadata Table so it's semi-available, but harder to get to during routine operations and investigations.

This could also feasibly be used via Inspector or something similar, if made available via some API.

Proposal

A few steps to tackle the problem, can definitely change based on further findings or ideas.

  • Add Filename.path column (easy)
  • Populate Filename.path during file upload, around here (easy)
  • Backfill the column from existing data in File.path for filenames that match (easyish)
  • Backfill the column for remaining empty entries from BigQuery data (medium to hard)

miketheman avatar Mar 11 '25 16:03 miketheman