fine-uploader
fine-uploader copied to clipboard
Calculate and send the hash of the file on the client to avoid uploading something that already exists
Hey,
It would be really great to find a way to calculate and send the hash of the file from the client to avoid uploading something that already exists on the server.
If this is possible for large files, it could save a lot of time for some users. And it would feel amazing.
I guess JS with FileReader would be too slow as of now, but it's worth checking anyway. If you see any mentions of native hashing functions included in the newest version, please let me know.
Cheers, Andrew
Seems possible, but only in browsers that support FileReader. We'll look into this more in a future release.
As far as I know, hash calculation time is proportional to the size of the file to be hashed. There is nothing we can do about this, as far as I can tell.
Another concern is running out of browser memory when reading the file before calculating the hash. This, we can probably deal with by splitting files into small chunks and feeding them into an MD5 calculator, chunk by chunk, until we are done. Most likely, a library such as SparkMD5 would be imported into Fine Uploader to handle hash determination. SparkMD5 adds some useful features (such as the ability to calculate a hash in chunks) on to fairly well-known md5.js script written by Joseph Myers. Unless I completely misunderstand the MD5 algorithm and the SparkMD5 source, this should allow us to ensure that we do not exhaust browser memory while calculating a file's hash.
It may be useful to calculate the hash in a web worker as well, so we can free up the UI thread.
related to #848
+1: I desperately need this feature in my CMS as well. Priorly I was using JumpLoader Java applet for upload purposes, and it had this feature - it could send the hash of overall file at the start of upload, and once server said "we don't have it, send it", it sent file in chunks and during every chunk request it also hashed the chunks. I see this issue is like 3 years old - is there any plans to implement this feature please? Thanks.
@shehi Funny you mention this - I already had to implement file hashing client-side in order to support version 4 signature support in Fine Uploader S3 (see #1336 for details). However, that only hashes small chunks, and not an entire file in one go. So, pushing this a bit further and calculating the hash of an entire file client-side is certainly doable, but I think more work is needed for larger files to prevent the UI thread from locking up.
In order to complete this feature, I expect to use V4 content body hashing as a model, but expand upon that code as follows:
- [ ] Hash the file in a web worker wherever possible. Where not possible, we'll need to consider a spinner or something similar.
- [ ] Determine if there is a more efficient way to hash an entire file, especially one that is multiple GB.
- [ ] Determine which hashing algorithm to use. I would expect to use the one with least overhead.
- [ ] Setup a new option that includes an endpoint to send the hash.
I hate this. I need to practice programming in advanced JS and NodeJS :( So out of loop here regarding the jargon.
I could actually work around this requirement with less subtle tactic: You said you can hash small chunks. That provided, I could use that to record the hash of first 2-3 chunks in database, and next time someone uploads a file, their chunk hashes could be cross checked against database, alongside with matching total file size. Same hash of first 100 kilobytes, for the file with exactly the same size = I think these odds can't be beaten. What you think? Of course, I also check file magic signatures for their real types, so there exists that parameter for cross-checking as well. I use TrID for this purpose.
http://mark0.net/soft-trid-deflist.html
Small chunk hashing is restricted to Fine Uploader S3, and only when generating version 4 signatures. I feel that this feature needs more input from other users before I consider implementing it. Some questions I have include:
- Should we hash the whole file, or only a piece, or n pieces/bytes?
- What hashing algorithm should we use?
These are questions that should be answered based on user input, as I want to create a solution that is generally usable.
Understandable. As I said, partial (as in few chunks) hashing should do the trick. And since it's small amount of data that needs hashing, who cares what algorithm is used? We are not hashing everything anyway - will be quick. Hashing could both be based on the number of chunks (instead of certain amount of bytes) or vice versa - in either case, that overall amount of data hashed should be limited for evident performance reasons.
As I said, partial (as in few chunks) hashing should do the trick
I'm not convinced that this is the correct approach, and am apprehensive about codifying this as the solution to this problem in a future version of Fine Uploader.
since it's small amount of data that needs hashing, who cares what algorithm is used
The server and client must use the same algorithm, otherwise the hashes will never match
It's either complete hashing, or it isn't. You already have been apprehensive regarding to implement the former, due to performance reasons alongside with some security restrictions certain browsers enforce. I don't think anyone would hesitate in doing that otherwise.
The other remaining approach is something limited, partial. And in this scenario you have plenty of data to cross-check file identity against:
- File size
- File magic signature (first few bytes of binary files and of certain text files are always the same)
- Hashes
First two options we can check easily, even in server side. The latter can be achieved if you hash and record certain predetermined byte-ranges of files and check against those records during subsequent uploads. I believe with these 3 types of data, file identity can accurately be determined.
But there remains one problem: certain files, mostly media files, may have so called "header information", a metadata at the end of the file (please correct me if I am mistaken). Video files and image files with metadata are good examples (have to check and ascertain the location of metadata in those files though, not sure). Two different files, even with same type and magic signature can also have same trailing metadata bytes. That makes it hard to rely on this particular method.
No matter what you devise though, I believe toggle-able bad feature is always better than no feature. You can receive more inputs from community if people toggle half-baked feature on and experiment with it. Your call of course. But like this, this issue will sit here for more years to come :)
There are two features here that would be valuable to me. First is just to compute a checksum (md5 would be fine) for each chunk and send it along with the chunk. This way I can detect corruption during the upload right away and request that that chunk be re-sent. Secondly, sending the whole-file checksum upon successful completion of a file upload would allow me to verify on the server that all the chunks made into the right places in the file and give an end-to-end verification that everything was done correctly. Using SparkMD5, you could compute the overall checksum one chunk at a time while the file is uploading, so that very little extra time would be spend at the end.
The per-chunk checksum is already being calculated to support S3 v4 signatures, though it's not being used anywhere else at the moment. If each chunk is hashed, there isn't a good reason to re-hash the entire file as well, since this will add considerable overhead to the process, especially with very large files. As long as you combine the chunks in the order specified, the file is fine.
Is there currently a way to use the per chunk checksum when using a traditional server (not S3)? That would be valuable to me. You are correct, the overall file checksum is not strictly required, just a little extra paranoia. I'd be happy if I could use the per chunk checksum with a traditional server. Thanks.
Is there currently a way to use the per chunk checksum when using a traditional server (not S3)?
No, but it likely wouldn't be very difficult to integrate this into the traditional endpoint uploader.
This is something I'm looking into now. I don't see this being a feature implemented inside the Fine Uploader codebase. Instead, a couple small changes will be needed to make it possible to write some code that makes this possible using the existing Fine Uploader API and event system. I'll tackle this by making the required changes to Fine Uploader and then I'll write up a small integration example (probably with updates to an existing server-side example as well) that will make it easy for anyone using Fine Uploader in their project to benefit from this feature. My plan is outlined below.
Duplicate file detection on upload
- Usable with any project that uses Fine Uploader.
- Big win for large files.
- 2 possible ways to implement this, with "plan A" being the ideal.
For both plans, consider the following:
On my MacBook Pro, it takes 5 seconds to hash a 200 MB file in the browser. It will probably take less than a second to ask the server if that hash exists elsewhere. So, about 6 seconds. In either plan, a successful upload must include the client-side hash, which must be stored in the DB for future duplicate detection. If the 200 MB file is a duplicate and we uploaded it anyway, it would take 7 minutes to needlessly upload that same file on my home internet connection (which is quite fast). So, if the file is a duplicate, this 7 minute upload will be skipped entirely.
Also understand that changes to Fine Uploader will be minimal. The hashing and server communication is something that integrators will take on. I'll provide a simple example implementation as part of this issue.
Plan A
Start uploading the file immediately and start the hashing/duplicate detection process at the same time. Then cancel the upload once the file has been found to be a duplicate. The time to hash and ask the server to run a duplicate check does not adversely affect the upload time, in case the file is not a duplicate. The hypothesis here is that this is the ideal approach in terms of conserving user time.
Tasks:
- [x] Update Fine Uploader to accept a "reason" for a
cancel
API call. A file canceled with a reason (such as "duplicate") will remain visible in Fine Uploader UI. Ideally the cancel message would be displayed as status in the upload card. TheonCancel
event will include the passed reason in this case, so other UI implementations (such as React Fine Uploader) can also keep the file representation visible, indicating the passed reason. These are the only changes to Fine Uploader. - [ ] Generate hash of a large file by breaking it into chunks and hashing each chunk:
Blob.slice
,FileReader
,ArrayBuffer
, and SparkMD5. You must do this in your own project. - [ ] Send request to server endpoint w/ calculated hash. If duplicate, cancel upload w/ message to display on file card. You must do this in your own project.
- [ ] After (if) upload completes, send a request to the server w/ the file hash. Server must save this hash with the file record for future duplicate file queries. You must do this in your own project.
Plan B
Check for a duplicate first. Reject if duplicate, otherwise start upload. Since the upload is delayed until hashing an duplicate detection is complete, this will add 6 seconds to the 7 minute file upload.
Tasks:
- [ ] Update Fine Uploader to optionally display rejected files. A duplicate file will be rejected in
onSubmit
, but we want to user to see the file in the upload UI anyway. Ideally the rejection message would be displayed as status in the upload card. These are the only changes to Fine Uploader. - [ ] Observe
onSubmit
callback in Fine Uploader. At this point, check to see if the file is a duplicate. You must do this in your own project. - [ ] Generate hash of a large file by breaking it into chunks and hashing each chunk:
Blob.slice
,FileReader
,ArrayBuffer
, and SparkMD5. - [ ] Send request to server endpoint w/ calculated hash. If duplicate, reject w/ message to display on file card. You must do this in your own project.
- [ ] If file is not a duplicate, do not reject and include the hash as a parameter with the file. Server must store this hash alongside the file record for future duplicate file queries. You must do this in your own project.
Indeed, Plan-A sounds more reasonable.
On 02-10-2016 06:03, Ray Nicholus wrote:
/duplicate detection
Wouldn't this allow anyone to add a file to their own uploads as long as the hash is known of any file on the server? I see how this is probably a bit of a stretch since if you have the hash of a file you probably have the file already too, but it might be something to take in consideration.
Wouldn't this allow anyone to add a file to their own uploads as long as the hash is known of any file on the server
I'm not sure I follow. This is simply a check to determine if a file exists on the server, given its hash. If it does exist, then the file is not uploaded. Can you explain the issue you are seeing a bit more?
Well if I upload file foo.docx
with hash 123
(both examples ofcourse). And given there are multiple users, then another user could simply send 123
as a hash faking a upload with that hash and gain access to my foo.docx
.
However as I mentioned above this might be a extremely low "risk" since if I know the hash of foo.docx
I probably already have access to it. Dropbox uses similair techniques to optimize their storage but generates hashes server side making them secure for users spoofing hashes.
Yea, Alex has a point. This being client-side tech, there are plenty of ways to spoof the data being sent. Nevertheless, we should have this feature for those who are willing to opt for it.
On 02.10.2016 23:07, Alex Bouma wrote:
Well if I upload file |foo.docx| with hash |123| (both examples ofcourse). And given there are multiple users, then another user could simply send |123| as a hash faking a upload with that hash and gain access to my |foo.docx|.
However as I mentioned above this might be a extremely low "risk" since if I know the hash of |foo.docx| I probably already have access to it. Dropbox uses similair techniques to optimize their storage but generates hashes server side making them secure for users spoofing hashes.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FineUploader/fine-uploader/issues/585#issuecomment-250992470, or mute the thread https://github.com/notifications/unsubscribe-auth/AAn5Ac4YKFENcpVckoVMbtirBNujXiMbks5qwA8ZgaJpZM4AW9EM.
It's definitly an awesome feature to have. It might be possible to upload a small block of the file (say max. 1mb) and also save that hash server side so that if the file is already on the server the client can prove it has the file by uploading a small portion so that the section can be validated server side. Although that might add too much complexity but makes if more usable for multi-user systems. Although I'm not a security expert so that solution could be as insecure as the original.
another user could simply send 123 as a hash faking a upload with that hash and gain access to my foo.docx.
How would they "gain access"? As I said before, if the file hash exists, then the file simply isn't uploaded. No one is provided access to anything.
Regardless, it's up to you to implement the feature however you want. This is really not a "feature" of Fine Uploader. It won't be baked into the library. My example will follow the plan described a few posts back.
This being client-side tech, there are plenty of ways to spoof the data being sent.
A spoofed hash doesn't harm anyone other than the uploader, as their file simply won't be uploaded. At least, that's how I would implement this feature.
Ah. I totally read that part wrong. I thought this would be part of FineUploader and totally missed that. Sorry about that :)
The only planned change to Fine Uploader is that which is documented in "plan A" above (see the first "task"). The rest will be an integration which will be demonstrated as described in the same plan.
Well a spoofed hash cannot be detected by the server without the file correct? So if I know the hash of a file on the server I could link it to my user or something.
But This is ONLY in case of a multi-user environment or shared storage. For a single repository of files this is ofcourse irrelevant since there are no access controls in place.
But all of this is offtopic for the described changes. Sorry for misreading.
Well a spoofed hash cannot be detected by the server without the file correct? So if I know the hash of a file on the server I could link it to my user or something.
Sorry, I'm really not following your logic at all. What does "link it to my user" mean? How would you do this?
I suggest that, if you do end up implementing this feature into your own project, you not simply serve up files without checking for appropriate permissions. The file hashing feature described here isn't really relevant to this discussion of security, since it's not meant to be anything more than a hash comparison.
Well as I said this is only a concern in a multi user environment. Consider Dropbox for example. They use file hashes to deduplicate the files uploaded to them on the filesystem. If those hashes are client side they could possibly be spoofed by a client telling I have a file with hash foo
without having the file (since it's all client side this could easily be done). Server finds the hash in the database and confirms it exists and completes the upload and "links the file" on the server to the spoofed upload of the client.
I thought this was laying the groundwork for inplementing the client side of this mechanism that could lead to the above potentionally insecure server implementations as mentioned above. So yes, this is not a concern for the changes described above.
Ah, I see, you had a very specific implementation in mind. But this is not something that any client-side library could ever prevent against. The best defense against this is to ensure you never blindly serve sensitive resources. Instead, a permission check server-side of some sort is prudent.