fiche icon indicating copy to clipboard operation
fiche copied to clipboard

Dedupe?

Open spikebike opened this issue 8 years ago • 7 comments

Seems reasonable to take a sha256 of each submission, and keep track of checksum = upload. That way multiple uploads of the same content get the same URL.

spikebike avatar Apr 23 '16 01:04 spikebike

That's a cool idea, we definitely should implement this.

solusipse avatar Jul 09 '16 15:07 solusipse

I was at first thinking about using a database to implement this, but it adds quite a bit of complexity to a beautifully simple program.

The 'simple' solution is using the hash itself as a file name, which would be quite a bit more confusing than the generated slugs.

Another solution is having a separate file for each sha256sum with the real slug stored inside. This, however, means twice as many files and a ton of block waste (4k for 6 bytes of real data with the default slug length).

Perhaps a simple constant database would be the best bet for handling this kind of mapping.

tomdwaggy avatar Dec 18 '16 08:12 tomdwaggy

On 12/18/2016 12:15 AM, Tom the Dragon wrote:

I was at first thinking about using a database to implement this, but it adds quite a bit of complexity to a beautifully simple program.

The 'simple' solution is using the hash itself as a file name, which would be quite a bit more confusing than the generated slugs.

Confusing how? It's generally known as CAS (Content Addressable Storage). It's a common way to provide both deduplication and a way to verify what you read is what was written.

Another solution is having a separate file for each sha256sum with the real slug stored inside. This, however, means twice as many files and a ton of block waste (4k for 6 bytes of real data with the default slug length).

You already create a file and directory for each upload. Tracking a slug -> checksum mapping in sqlite or even just a simple key/value store doesn't seem that much more complicated.

Each upload would: stream to a temp file and stream through the checksum if already exists delete the temp file if doesn't exist mv temp file to slug/ and add a slug -> checksum mapping

Download would: Check DB/key value store for existence, if there look up checksum If exists read/return file.

Perhaps a simple constant database would be the best bet for handling this kind of mapping.

Hard to say if it's worth it. Could you run a sha256 on each termbin upload and see how many of them are unique? Could just run something like: find ./code -type f -exec sha256sum {} ; | awk '{ print $1 }'|wc -l

Then compare to: find ./code -type f -exec sha256sum {} ; | awk '{ print $1 }'|sort|uniq|wc -l

spikebike avatar Dec 18 '16 16:12 spikebike

What I was saying would be confusing (to the user) is using the actual sha256 as what gets inserted into the URL.

Download would: Check DB/key value store for existence, if there look up checksum If exists read/return file.

The download is handled by your web server, not fiche. You'd want the file to represent what the user is going to copy/paste, and have all the logic in your upload.

tomdwaggy avatar Dec 19 '16 05:12 tomdwaggy

IMHO, dedup is not fiche’s work. The filesystem can handle this. Just choose an appropriate filesystem…

tYYGH avatar Dec 19 '16 09:12 tYYGH

using the hash itself as a file name

The "best" scheme IMO would be to internally address files by cryptographic hash, but allow requests to use a runtime-configurable subset of these. You'd then maintain an in-memory mapping of prefixhash pairs, where the last-modified prefix value wins. Users that want long-lived pastes can just use the full hash instead.

An additional optimization might be to present the prefix and/or hash as a urlsafe_base64-encoded value (contrast to a hex-encoded one), where, for example, a 4-character prefix encodes 16777216 values in base64, compared to 65536 values in base16. This is more also more efficient than your current base36 scheme.

buhman avatar Oct 11 '17 09:10 buhman

Ref for what @tYYGH mentioned: https://superuser.com/questions/1547993/filesystems-that-handle-data-duplication-efficiently-and-transparently

Not always possible, afaik you can't emulate that, your environment is not always what you want it to be. A software solution would be great anyways, perhaps one that can be toggled.

@spikebike, you seem to be confident, might want to give it a shot? :)

martin-braun avatar Jan 18 '24 16:01 martin-braun