Reuse existing file attachments
Problem
After PR #1877 is merged, please see if you can modify those changes so that uploading the same file multiple times to the chat doesn't result in the same file getting uploaded multiple times.
Solution
- Implement a "forward" feature, that lets the user forward an existing message to others, by including the decryption key along with the forwarded message so that the file doesn't need to be re-uploaded.
- If possible, detect that the file had already been uploaded before (by using an HTTP
HEADcheck on the file'smanifestCid), and reuse the existing information instead of reuploading the file.
Implement a "forward" feature, that lets the user forward an existing message to others, by including the decryption key along with the forwarded message so that the file doesn't need to be re-uploaded.
If this is the ultimate goal, it can already be implemented today and it does not require re-uploading files. A 'forward' feature would presumably work similarly to reply, except that it'd copy the message contents. Since files are not stored in the message itself (only a reference to them is, in the form of download parameters), forwarding a message that includes files doesn't require any re-uploading.
If possible, detect that the file had already been uploaded before (by using an HTTP HEAD check on the file's manifestCid), and reuse the existing information instead of reuploading the file.
This is difficult or even undesirable to do because of several factors.
- When a(n encrypted) file is uploaded, it's preprocessed using random data to derive the file's encryption key and a nonce. These need to be random (at the very least, the nonce must not be re-used for any given encryption key). Because of this randomness, uploading an identical file repeatedly will result in entirely different content.
- There are also various degrees of freedom in the manifest itself. For one, it can contain arbitrary data, so you can't ensure that the manifest itself is unique. Since the file can contain arbitrary chunks, each chunk will also have a different CID.
- Using a header or similar is somewhat incompatible with the purpose of using streams, because you need to keep the entire data in memory (or process the content twice on the client). This is because you don't know what the hash will be until after reading the file (and encrypting it, if you're using encryption). This said, it could have some usefulness to maybe just upload the data and have the server do this check (i.e., if a chunk already exists, it ignores it).
Now, these are a few ways this could work, but all seem rather limited.
One is making the 'random' data deterministic. However, the only thing that makes sense for this approach is to derive it from the file itself (plus the encryption key that'll be used in a chatroom). Doing this poses two challenges:
- Using the file as input again requires either loading the entire file in memory or processing it twice client-side.
- Using contract keys as input limits the window of usefulness of this feature to the most recent key rotation for this key (using contract keys could perhaps be avoided, but it leaks information about files).
There are also some additional concerns with not reuploading files (or chunks), which is leaking some metadata about the contents of encrypted files. It could be that this is an acceptable trade-off.
The other way is for clients to keep track of files uploaded and re-use existing files. The downside of this is that clients need to keep this information in the state somewhere, and that they can only do it for files that they've downloaded.
In any situation, because files are meant to be associated with a contract for billing purposes, this makes most sense to implement by at least partitioning re-uploads to a contract. I'm not sure what the odds are of exact file duplication being frequent enough for this to be an issue. For example, if users are sharing GIFs or short videos, it's unlikely that the files will be identical (which is the only thing using hashes the way we do will address), as they may have different provenance with different encodings and other artifacts.
Nice, thanks for laying that out @corrideat and updating the comment with more detail. For now then, we can focus this issue on implementing a forward feature only, and perhaps later on come back to the idea of detecting existing files some other day.