rfcs Filestore Improvements

This is a meta RFC to cover some of the potential improvements to our filestore system.

Jul 13 '23 13:07 mitsuhiko

So if I read this correctly, you would still want to chunk files, but not saving those chunks deduplicated, thus avoiding the atomic reference counting problem?

How would you manage the migration from the old system to the new one? Will there ever be a cut-off date at which you can just hard-drop the old tables and GCS storage?

As this whole blob-related discussion started off with my discovery of a race-condition between blob upload and blob deletion, would this be solved by splitting off the staging area for uploads as you suggested from the long term storage?

As a reminder, the race condition is actually two separate TOCTOU (Time-of-Check-to-Time-of-Use) problems:

Before even uploading, sentry-cli asks the backend server which chunks are missing based on the chunk-hash. Between this check and the final file assembly, the blob can be deleted, failing the assemble.
When assembling the final File, it first queries all the blobs based on their chunk-hash. Between this check, and actually inserting a reference into the BlobIndex table, the blob is being deleted, failing the assemble.

I believe the first problem can be solved by a dedicated per-org staging area, one that will refresh a chunks TTL on query by chunk-hash.

The second problem can be either solved by not storing blobs deduplicated, like suggested. Or I believe an epoch-based reclamation could also be used while still keeping deduplication:

Deletion would schedule to delete a chunk for epoch N
When assembling, we can use an UPSERT to increment the epoch in the database. (time of check)
In between, the deletion job would delete records with a matching epoch N, but deletion would not do anything as the epoch was already bumped to N+1
In the next assemble step, when creating BlobIndex entries, the blob is still there, yay (time of use)

Not sure if that complexity would be worth it, or we can just store duplicated blobs.

Deletions would be trivial, and also possible for older files and blobs correctly if we do not have concurrent writes and deletes.

Jul 20 '23 09:07 Swatinem

So if I read this correctly, you would still want to chunk files, but not saving those chunks deduplicated, thus avoiding the atomic reference counting problem?

I don't know. I think I would allow chunking as part of the system but I would force the chunk to be associated with the offset. But honestly for most of the stuff we probably want to do, one huge chunk for the entirety of the file is probably preferable in practical terms.

Jul 20 '23 14:07 mitsuhiko

Can we get a histogram of chunk reuse reasonably? I would love to have some real data on how that reuse looks like. Maybe the "empty chunk" is potentially shared a ton.

Jul 20 '23 14:07 Swatinem