Helper method for data deduplication in File Storage
It would be nice to have a helper method for Convex's File Storage that implements S3 conditional writes; that way it could skip storing the file and just return the existing storage ID instead.
Unlike S3, when you upload a file to Convex file storage, you don't specify an object key. Rather Convex gives you an object key.
Eg
// Store the image in Convex
const storageId: Id<"_storage"> = await ctx.storage.store(image);
So there's not really a case for multiple clients uploading to the same storage id.
Typically with file storage, as described in the docs (that you linked), you'd take the storage ID and store it in your own tables. There, you can take advantage of Convex's full serializability to order your mutations and guarantee that they run in a consistent sequential ordering.
I almost think of S3 conditional writes as a solution to a problem that Convex doesn't have.
What use case are you envisioning?
Right now, my main concern is actually that the same file might be uploaded multiple times with different IDs. It would be helpful to have a built-in method to check whether the image you're about to upload matches the SHA-256 hash of an existing image in File Storage, and if so, return the existing image's ID instead of uploading a duplicate.
Have you tried to get that behavior from S3? I don't think that seems possible with conditional writes. You're asking for something stronger - a deduplication by object key.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/conditional-requests.html I don't think preconditions in S3 go across object keys to give the kind of deduplication you are referencing.
In Convex (or on S3 for that matter), you can implement the behavior you are going for by storing the sha256 of the object in your own table, adding an index on sha256, and then looking it up by sha256 prior to calling ctx.storage.store.
By uploading files via http action, you can put arbitrary logic - like the dedup logic you're suggesting in that code https://docs.convex.dev/file-storage/upload-files#uploading-files-via-an-http-action
In Convex (or on S3 for that matter), you can implement the behavior you are going for by storing the sha256 of the object in your own table, adding an index on sha256, and then looking it up by sha256 prior to calling ctx.storage.store.
You don't need to calculate and store the SHA-256 hash of the object, as S3 does it for you automatically (see here).
All you need to do is:
- Set the
ChecksumAlgorithmtoSHA256when calling thePutObjectCommandfor uploading a file. - Call the
GetObjectAttributesCommandbefore trying to store a new file to check that a file with the same SHA-256 hash doesn't exist already.
The problem is that you aren't dealing directly with S3 in Convex, so I'd need some helper methods that perform these actions behind the scenes.
Can you write some code or pseudocode that describes what you're trying to achieve with the calls to S3? I am not understanding how a call to GetObjectAttributes achieves what you are going for.
Convex (nor S3) provide a way to efficiently look up a file in a bucket by its content (which is what it sounds like you are hoping for).
S3 can efficiently ensure that when modifying an existing file by object key (with PutObject), that if the file has no changes, don't send an upload.
Convex does not have a way of modifying a file by storage_id, so the conditional write feature (afaict) does not make sense to implement here.
In my last message, I wasn’t referring to conditional writes. I was simply suggesting the use of file metadata to prevent storing the same object multiple times. The goal is to have idempotency: if ctx.storage.store(image) is called repeatedly with the same image, it should store it the first time and return the existing file ID on subsequent calls.
In my last message, I wasn’t referring to conditional writes. I was simply suggesting the use of file metadata to prevent storing the same object multiple times. The goal is to have idempotency: if
ctx.storage.store(image)is called repeatedly with the same image, it should store it the first time and return the existing file ID on subsequent calls.
Ah yes I follow. I believe you are referring to object deduplication. I do not think this built into S3, however you can build it on top of S3. Correct me if I'm wrong (ideally by writing some code that exercises this with S3).
Similarly, it's not built into Convex, but you can build it on top of Convex fairly easily.
A few weeks ago I tried computing the SHA-256 of an object before uploading it and comparing it with the hash returned by ctx.db.system.query("_storage").collect() before attempting to store it in File Storage. For some reason, I remember it wasn't matching.