convex-backend icon indicating copy to clipboard operation
convex-backend copied to clipboard

Helper method for data deduplication in File Storage

Open garysassano opened this issue 6 months ago • 8 comments

It would be nice to have a helper method for Convex's File Storage that implements S3 conditional writes; that way it could skip storing the file and just return the existing storage ID instead.

garysassano avatar May 28 '25 00:05 garysassano

Unlike S3, when you upload a file to Convex file storage, you don't specify an object key. Rather Convex gives you an object key.

Eg

    // Store the image in Convex
    const storageId: Id<"_storage"> = await ctx.storage.store(image);

So there's not really a case for multiple clients uploading to the same storage id.

Typically with file storage, as described in the docs (that you linked), you'd take the storage ID and store it in your own tables. There, you can take advantage of Convex's full serializability to order your mutations and guarantee that they run in a consistent sequential ordering.

I almost think of S3 conditional writes as a solution to a problem that Convex doesn't have.

What use case are you envisioning?

nipunn1313 avatar May 29 '25 03:05 nipunn1313

Right now, my main concern is actually that the same file might be uploaded multiple times with different IDs. It would be helpful to have a built-in method to check whether the image you're about to upload matches the SHA-256 hash of an existing image in File Storage, and if so, return the existing image's ID instead of uploading a duplicate.

garysassano avatar May 29 '25 17:05 garysassano

Have you tried to get that behavior from S3? I don't think that seems possible with conditional writes. You're asking for something stronger - a deduplication by object key.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/conditional-requests.html I don't think preconditions in S3 go across object keys to give the kind of deduplication you are referencing.

In Convex (or on S3 for that matter), you can implement the behavior you are going for by storing the sha256 of the object in your own table, adding an index on sha256, and then looking it up by sha256 prior to calling ctx.storage.store.

By uploading files via http action, you can put arbitrary logic - like the dedup logic you're suggesting in that code https://docs.convex.dev/file-storage/upload-files#uploading-files-via-an-http-action

nipunn1313 avatar May 29 '25 18:05 nipunn1313

In Convex (or on S3 for that matter), you can implement the behavior you are going for by storing the sha256 of the object in your own table, adding an index on sha256, and then looking it up by sha256 prior to calling ctx.storage.store.

You don't need to calculate and store the SHA-256 hash of the object, as S3 does it for you automatically (see here).

All you need to do is:

  1. Set the ChecksumAlgorithm to SHA256 when calling the PutObjectCommand for uploading a file.
  2. Call the GetObjectAttributesCommand before trying to store a new file to check that a file with the same SHA-256 hash doesn't exist already.

The problem is that you aren't dealing directly with S3 in Convex, so I'd need some helper methods that perform these actions behind the scenes.

garysassano avatar May 29 '25 22:05 garysassano

Can you write some code or pseudocode that describes what you're trying to achieve with the calls to S3? I am not understanding how a call to GetObjectAttributes achieves what you are going for.

Convex (nor S3) provide a way to efficiently look up a file in a bucket by its content (which is what it sounds like you are hoping for).

S3 can efficiently ensure that when modifying an existing file by object key (with PutObject), that if the file has no changes, don't send an upload.

Convex does not have a way of modifying a file by storage_id, so the conditional write feature (afaict) does not make sense to implement here.

nipunn1313 avatar May 30 '25 00:05 nipunn1313

In my last message, I wasn’t referring to conditional writes. I was simply suggesting the use of file metadata to prevent storing the same object multiple times. The goal is to have idempotency: if ctx.storage.store(image) is called repeatedly with the same image, it should store it the first time and return the existing file ID on subsequent calls.

garysassano avatar May 30 '25 00:05 garysassano

In my last message, I wasn’t referring to conditional writes. I was simply suggesting the use of file metadata to prevent storing the same object multiple times. The goal is to have idempotency: if ctx.storage.store(image) is called repeatedly with the same image, it should store it the first time and return the existing file ID on subsequent calls.

Ah yes I follow. I believe you are referring to object deduplication. I do not think this built into S3, however you can build it on top of S3. Correct me if I'm wrong (ideally by writing some code that exercises this with S3).

Similarly, it's not built into Convex, but you can build it on top of Convex fairly easily.

nipunn1313 avatar May 30 '25 01:05 nipunn1313

A few weeks ago I tried computing the SHA-256 of an object before uploading it and comparing it with the hash returned by ctx.db.system.query("_storage").collect() before attempting to store it in File Storage. For some reason, I remember it wasn't matching.

garysassano avatar May 30 '25 10:05 garysassano