fiche
fiche copied to clipboard
Dedupe?
Seems reasonable to take a sha256 of each submission, and keep track of checksum = upload. That way multiple uploads of the same content get the same URL.
That's a cool idea, we definitely should implement this.
I was at first thinking about using a database to implement this, but it adds quite a bit of complexity to a beautifully simple program.
The 'simple' solution is using the hash itself as a file name, which would be quite a bit more confusing than the generated slugs.
Another solution is having a separate file for each sha256sum with the real slug stored inside. This, however, means twice as many files and a ton of block waste (4k for 6 bytes of real data with the default slug length).
Perhaps a simple constant database would be the best bet for handling this kind of mapping.
On 12/18/2016 12:15 AM, Tom the Dragon wrote:
I was at first thinking about using a database to implement this, but it adds quite a bit of complexity to a beautifully simple program.
The 'simple' solution is using the hash itself as a file name, which would be quite a bit more confusing than the generated slugs.
Confusing how? It's generally known as CAS (Content Addressable Storage). It's a common way to provide both deduplication and a way to verify what you read is what was written.
Another solution is having a separate file for each sha256sum with the real slug stored inside. This, however, means twice as many files and a ton of block waste (4k for 6 bytes of real data with the default slug length).
You already create a file and directory for each upload. Tracking a slug -> checksum mapping in sqlite or even just a simple key/value store doesn't seem that much more complicated.
Each upload would:
stream to a temp file and stream through the checksum
if already exists delete the temp file
if doesn't exist mv temp file to slug/
Download would: Check DB/key value store for existence, if there look up checksum If exists read/return file.
Perhaps a simple constant database would be the best bet for handling this kind of mapping.
Hard to say if it's worth it. Could you run a sha256 on each termbin upload and see how many of them are unique? Could just run something like: find ./code -type f -exec sha256sum {} ; | awk '{ print $1 }'|wc -l
Then compare to: find ./code -type f -exec sha256sum {} ; | awk '{ print $1 }'|sort|uniq|wc -l
What I was saying would be confusing (to the user) is using the actual sha256 as what gets inserted into the URL.
Download would: Check DB/key value store for existence, if there look up checksum If exists read/return file.
The download is handled by your web server, not fiche. You'd want the file to represent what the user is going to copy/paste, and have all the logic in your upload.
IMHO, dedup is not fiche’s work. The filesystem can handle this. Just choose an appropriate filesystem…
using the hash itself as a file name
The "best" scheme IMO would be to internally address files by cryptographic hash, but allow requests to use a runtime-configurable subset of these. You'd then maintain an in-memory mapping of prefix
→ hash
pairs, where the last-modified prefix
value wins. Users that want long-lived pastes can just use the full hash instead.
An additional optimization might be to present the prefix and/or hash as a urlsafe_base64-encoded value (contrast to a hex-encoded one), where, for example, a 4-character prefix encodes 16777216 values in base64, compared to 65536 values in base16. This is more also more efficient than your current base36 scheme.
Ref for what @tYYGH mentioned: https://superuser.com/questions/1547993/filesystems-that-handle-data-duplication-efficiently-and-transparently
Not always possible, afaik you can't emulate that, your environment is not always what you want it to be. A software solution would be great anyways, perhaps one that can be toggled.
@spikebike, you seem to be confident, might want to give it a shot? :)