fine-uploader icon indicating copy to clipboard operation
fine-uploader copied to clipboard

Calculate and send the hash of the file on the client to avoid uploading something that already exists

Open valums opened this issue 12 years ago • 50 comments

Hey,

It would be really great to find a way to calculate and send the hash of the file from the client to avoid uploading something that already exists on the server.

If this is possible for large files, it could save a lot of time for some users. And it would feel amazing.

I guess JS with FileReader would be too slow as of now, but it's worth checking anyway. If you see any mentions of native hashing functions included in the newest version, please let me know.

Cheers, Andrew

valums avatar Jan 13 '13 22:01 valums

Seems possible, but only in browsers that support FileReader. We'll look into this more in a future release.

rnicholus avatar Nov 25 '13 19:11 rnicholus

As far as I know, hash calculation time is proportional to the size of the file to be hashed. There is nothing we can do about this, as far as I can tell.

Another concern is running out of browser memory when reading the file before calculating the hash. This, we can probably deal with by splitting files into small chunks and feeding them into an MD5 calculator, chunk by chunk, until we are done. Most likely, a library such as SparkMD5 would be imported into Fine Uploader to handle hash determination. SparkMD5 adds some useful features (such as the ability to calculate a hash in chunks) on to fairly well-known md5.js script written by Joseph Myers. Unless I completely misunderstand the MD5 algorithm and the SparkMD5 source, this should allow us to ensure that we do not exhaust browser memory while calculating a file's hash.

It may be useful to calculate the hash in a web worker as well, so we can free up the UI thread.

rnicholus avatar Nov 26 '13 05:11 rnicholus

related to #848

rnicholus avatar Feb 04 '15 20:02 rnicholus

+1: I desperately need this feature in my CMS as well. Priorly I was using JumpLoader Java applet for upload purposes, and it had this feature - it could send the hash of overall file at the start of upload, and once server said "we don't have it, send it", it sent file in chunks and during every chunk request it also hashed the chunks. I see this issue is like 3 years old - is there any plans to implement this feature please? Thanks.

shehi avatar Nov 16 '15 15:11 shehi

@shehi Funny you mention this - I already had to implement file hashing client-side in order to support version 4 signature support in Fine Uploader S3 (see #1336 for details). However, that only hashes small chunks, and not an entire file in one go. So, pushing this a bit further and calculating the hash of an entire file client-side is certainly doable, but I think more work is needed for larger files to prevent the UI thread from locking up.

In order to complete this feature, I expect to use V4 content body hashing as a model, but expand upon that code as follows:

  • [ ] Hash the file in a web worker wherever possible. Where not possible, we'll need to consider a spinner or something similar.
  • [ ] Determine if there is a more efficient way to hash an entire file, especially one that is multiple GB.
  • [ ] Determine which hashing algorithm to use. I would expect to use the one with least overhead.
  • [ ] Setup a new option that includes an endpoint to send the hash.

rnicholus avatar Nov 16 '15 15:11 rnicholus

I hate this. I need to practice programming in advanced JS and NodeJS :( So out of loop here regarding the jargon.

I could actually work around this requirement with less subtle tactic: You said you can hash small chunks. That provided, I could use that to record the hash of first 2-3 chunks in database, and next time someone uploads a file, their chunk hashes could be cross checked against database, alongside with matching total file size. Same hash of first 100 kilobytes, for the file with exactly the same size = I think these odds can't be beaten. What you think? Of course, I also check file magic signatures for their real types, so there exists that parameter for cross-checking as well. I use TrID for this purpose.

http://mark0.net/soft-trid-deflist.html

shehi avatar Nov 16 '15 16:11 shehi

Small chunk hashing is restricted to Fine Uploader S3, and only when generating version 4 signatures. I feel that this feature needs more input from other users before I consider implementing it. Some questions I have include:

  • Should we hash the whole file, or only a piece, or n pieces/bytes?
  • What hashing algorithm should we use?

These are questions that should be answered based on user input, as I want to create a solution that is generally usable.

rnicholus avatar Nov 16 '15 16:11 rnicholus

Understandable. As I said, partial (as in few chunks) hashing should do the trick. And since it's small amount of data that needs hashing, who cares what algorithm is used? We are not hashing everything anyway - will be quick. Hashing could both be based on the number of chunks (instead of certain amount of bytes) or vice versa - in either case, that overall amount of data hashed should be limited for evident performance reasons.

shehi avatar Nov 16 '15 18:11 shehi

As I said, partial (as in few chunks) hashing should do the trick

I'm not convinced that this is the correct approach, and am apprehensive about codifying this as the solution to this problem in a future version of Fine Uploader.

since it's small amount of data that needs hashing, who cares what algorithm is used

The server and client must use the same algorithm, otherwise the hashes will never match

rnicholus avatar Nov 16 '15 18:11 rnicholus

It's either complete hashing, or it isn't. You already have been apprehensive regarding to implement the former, due to performance reasons alongside with some security restrictions certain browsers enforce. I don't think anyone would hesitate in doing that otherwise.

The other remaining approach is something limited, partial. And in this scenario you have plenty of data to cross-check file identity against:

  • File size
  • File magic signature (first few bytes of binary files and of certain text files are always the same)
  • Hashes

First two options we can check easily, even in server side. The latter can be achieved if you hash and record certain predetermined byte-ranges of files and check against those records during subsequent uploads. I believe with these 3 types of data, file identity can accurately be determined.

But there remains one problem: certain files, mostly media files, may have so called "header information", a metadata at the end of the file (please correct me if I am mistaken). Video files and image files with metadata are good examples (have to check and ascertain the location of metadata in those files though, not sure). Two different files, even with same type and magic signature can also have same trailing metadata bytes. That makes it hard to rely on this particular method.

No matter what you devise though, I believe toggle-able bad feature is always better than no feature. You can receive more inputs from community if people toggle half-baked feature on and experiment with it. Your call of course. But like this, this issue will sit here for more years to come :)

shehi avatar Nov 16 '15 18:11 shehi

There are two features here that would be valuable to me. First is just to compute a checksum (md5 would be fine) for each chunk and send it along with the chunk. This way I can detect corruption during the upload right away and request that that chunk be re-sent. Secondly, sending the whole-file checksum upon successful completion of a file upload would allow me to verify on the server that all the chunks made into the right places in the file and give an end-to-end verification that everything was done correctly. Using SparkMD5, you could compute the overall checksum one chunk at a time while the file is uploading, so that very little extra time would be spend at the end.

khoran avatar Feb 19 '16 01:02 khoran

The per-chunk checksum is already being calculated to support S3 v4 signatures, though it's not being used anywhere else at the moment. If each chunk is hashed, there isn't a good reason to re-hash the entire file as well, since this will add considerable overhead to the process, especially with very large files. As long as you combine the chunks in the order specified, the file is fine.

rnicholus avatar Feb 23 '16 02:02 rnicholus

Is there currently a way to use the per chunk checksum when using a traditional server (not S3)? That would be valuable to me. You are correct, the overall file checksum is not strictly required, just a little extra paranoia. I'd be happy if I could use the per chunk checksum with a traditional server. Thanks.

khoran avatar Feb 23 '16 03:02 khoran

Is there currently a way to use the per chunk checksum when using a traditional server (not S3)?

No, but it likely wouldn't be very difficult to integrate this into the traditional endpoint uploader.

rnicholus avatar Feb 23 '16 04:02 rnicholus

This is something I'm looking into now. I don't see this being a feature implemented inside the Fine Uploader codebase. Instead, a couple small changes will be needed to make it possible to write some code that makes this possible using the existing Fine Uploader API and event system. I'll tackle this by making the required changes to Fine Uploader and then I'll write up a small integration example (probably with updates to an existing server-side example as well) that will make it easy for anyone using Fine Uploader in their project to benefit from this feature. My plan is outlined below.

Duplicate file detection on upload

  • Usable with any project that uses Fine Uploader.
  • Big win for large files.
  • 2 possible ways to implement this, with "plan A" being the ideal.

For both plans, consider the following:

On my MacBook Pro, it takes 5 seconds to hash a 200 MB file in the browser. It will probably take less than a second to ask the server if that hash exists elsewhere. So, about 6 seconds. In either plan, a successful upload must include the client-side hash, which must be stored in the DB for future duplicate detection. If the 200 MB file is a duplicate and we uploaded it anyway, it would take 7 minutes to needlessly upload that same file on my home internet connection (which is quite fast). So, if the file is a duplicate, this 7 minute upload will be skipped entirely.

Also understand that changes to Fine Uploader will be minimal. The hashing and server communication is something that integrators will take on. I'll provide a simple example implementation as part of this issue.

Plan A

Start uploading the file immediately and start the hashing/duplicate detection process at the same time. Then cancel the upload once the file has been found to be a duplicate. The time to hash and ask the server to run a duplicate check does not adversely affect the upload time, in case the file is not a duplicate. The hypothesis here is that this is the ideal approach in terms of conserving user time.

Tasks:

  • [x] Update Fine Uploader to accept a "reason" for a cancel API call. A file canceled with a reason (such as "duplicate") will remain visible in Fine Uploader UI. Ideally the cancel message would be displayed as status in the upload card. The onCancel event will include the passed reason in this case, so other UI implementations (such as React Fine Uploader) can also keep the file representation visible, indicating the passed reason. These are the only changes to Fine Uploader.
  • [ ] Generate hash of a large file by breaking it into chunks and hashing each chunk: Blob.slice, FileReader, ArrayBuffer, and SparkMD5. You must do this in your own project.
  • [ ] Send request to server endpoint w/ calculated hash. If duplicate, cancel upload w/ message to display on file card. You must do this in your own project.
  • [ ] After (if) upload completes, send a request to the server w/ the file hash. Server must save this hash with the file record for future duplicate file queries. You must do this in your own project.

Plan B

Check for a duplicate first. Reject if duplicate, otherwise start upload. Since the upload is delayed until hashing an duplicate detection is complete, this will add 6 seconds to the 7 minute file upload.

Tasks:

  • [ ] Update Fine Uploader to optionally display rejected files. A duplicate file will be rejected in onSubmit, but we want to user to see the file in the upload UI anyway. Ideally the rejection message would be displayed as status in the upload card. These are the only changes to Fine Uploader.
  • [ ] Observe onSubmit callback in Fine Uploader. At this point, check to see if the file is a duplicate. You must do this in your own project.
  • [ ] Generate hash of a large file by breaking it into chunks and hashing each chunk: Blob.slice, FileReader, ArrayBuffer, and SparkMD5.
  • [ ] Send request to server endpoint w/ calculated hash. If duplicate, reject w/ message to display on file card. You must do this in your own project.
  • [ ] If file is not a duplicate, do not reject and include the hash as a parameter with the file. Server must store this hash alongside the file record for future duplicate file queries. You must do this in your own project.

rnicholus avatar Oct 02 '16 03:10 rnicholus

Indeed, Plan-A sounds more reasonable.

On 02-10-2016 06:03, Ray Nicholus wrote:

/duplicate detection

shehi avatar Oct 02 '16 09:10 shehi

Wouldn't this allow anyone to add a file to their own uploads as long as the hash is known of any file on the server? I see how this is probably a bit of a stretch since if you have the hash of a file you probably have the file already too, but it might be something to take in consideration.

stayallive avatar Oct 02 '16 19:10 stayallive

Wouldn't this allow anyone to add a file to their own uploads as long as the hash is known of any file on the server

I'm not sure I follow. This is simply a check to determine if a file exists on the server, given its hash. If it does exist, then the file is not uploaded. Can you explain the issue you are seeing a bit more?

rnicholus avatar Oct 02 '16 19:10 rnicholus

Well if I upload file foo.docx with hash 123 (both examples ofcourse). And given there are multiple users, then another user could simply send 123 as a hash faking a upload with that hash and gain access to my foo.docx.

However as I mentioned above this might be a extremely low "risk" since if I know the hash of foo.docx I probably already have access to it. Dropbox uses similair techniques to optimize their storage but generates hashes server side making them secure for users spoofing hashes.

stayallive avatar Oct 02 '16 20:10 stayallive

Yea, Alex has a point. This being client-side tech, there are plenty of ways to spoof the data being sent. Nevertheless, we should have this feature for those who are willing to opt for it.

On 02.10.2016 23:07, Alex Bouma wrote:

Well if I upload file |foo.docx| with hash |123| (both examples ofcourse). And given there are multiple users, then another user could simply send |123| as a hash faking a upload with that hash and gain access to my |foo.docx|.

However as I mentioned above this might be a extremely low "risk" since if I know the hash of |foo.docx| I probably already have access to it. Dropbox uses similair techniques to optimize their storage but generates hashes server side making them secure for users spoofing hashes.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FineUploader/fine-uploader/issues/585#issuecomment-250992470, or mute the thread https://github.com/notifications/unsubscribe-auth/AAn5Ac4YKFENcpVckoVMbtirBNujXiMbks5qwA8ZgaJpZM4AW9EM.

shehi avatar Oct 02 '16 20:10 shehi

It's definitly an awesome feature to have. It might be possible to upload a small block of the file (say max. 1mb) and also save that hash server side so that if the file is already on the server the client can prove it has the file by uploading a small portion so that the section can be validated server side. Although that might add too much complexity but makes if more usable for multi-user systems. Although I'm not a security expert so that solution could be as insecure as the original.

stayallive avatar Oct 02 '16 20:10 stayallive

another user could simply send 123 as a hash faking a upload with that hash and gain access to my foo.docx.

How would they "gain access"? As I said before, if the file hash exists, then the file simply isn't uploaded. No one is provided access to anything.

Regardless, it's up to you to implement the feature however you want. This is really not a "feature" of Fine Uploader. It won't be baked into the library. My example will follow the plan described a few posts back.

rnicholus avatar Oct 02 '16 21:10 rnicholus

This being client-side tech, there are plenty of ways to spoof the data being sent.

A spoofed hash doesn't harm anyone other than the uploader, as their file simply won't be uploaded. At least, that's how I would implement this feature.

rnicholus avatar Oct 02 '16 21:10 rnicholus

Ah. I totally read that part wrong. I thought this would be part of FineUploader and totally missed that. Sorry about that :)

stayallive avatar Oct 02 '16 21:10 stayallive

The only planned change to Fine Uploader is that which is documented in "plan A" above (see the first "task"). The rest will be an integration which will be demonstrated as described in the same plan.

rnicholus avatar Oct 02 '16 21:10 rnicholus

Well a spoofed hash cannot be detected by the server without the file correct? So if I know the hash of a file on the server I could link it to my user or something.

But This is ONLY in case of a multi-user environment or shared storage. For a single repository of files this is ofcourse irrelevant since there are no access controls in place.

But all of this is offtopic for the described changes. Sorry for misreading.

stayallive avatar Oct 02 '16 21:10 stayallive

Well a spoofed hash cannot be detected by the server without the file correct? So if I know the hash of a file on the server I could link it to my user or something.

Sorry, I'm really not following your logic at all. What does "link it to my user" mean? How would you do this?

rnicholus avatar Oct 02 '16 21:10 rnicholus

I suggest that, if you do end up implementing this feature into your own project, you not simply serve up files without checking for appropriate permissions. The file hashing feature described here isn't really relevant to this discussion of security, since it's not meant to be anything more than a hash comparison.

rnicholus avatar Oct 02 '16 21:10 rnicholus

Well as I said this is only a concern in a multi user environment. Consider Dropbox for example. They use file hashes to deduplicate the files uploaded to them on the filesystem. If those hashes are client side they could possibly be spoofed by a client telling I have a file with hash foo without having the file (since it's all client side this could easily be done). Server finds the hash in the database and confirms it exists and completes the upload and "links the file" on the server to the spoofed upload of the client.

I thought this was laying the groundwork for inplementing the client side of this mechanism that could lead to the above potentionally insecure server implementations as mentioned above. So yes, this is not a concern for the changes described above.

stayallive avatar Oct 02 '16 21:10 stayallive

Ah, I see, you had a very specific implementation in mind. But this is not something that any client-side library could ever prevent against. The best defense against this is to ensure you never blindly serve sensitive resources. Instead, a permission check server-side of some sort is prudent.

rnicholus avatar Oct 02 '16 21:10 rnicholus