rdfind icon indicating copy to clipboard operation
rdfind copied to clipboard

[feature] Optimize for big files/virtual images

Open kdupke opened this issue 4 years ago • 6 comments

Hi,

I have the need to frequently check a bunch of virtual images for duplicates.

See below an example for a small data set.

Now have 324 files in total. Removed 179 files due to nonunique device and inode. Now removing files with zero size from list...removed 3 files Total size is 3205167303976 bytes or 3 TiB Now sorting on size:removed 30 files due to unique sizes from list.112 files left. Now eliminating candidates based on first bytes:removed 6 files from list.106 files left. Now eliminating candidates based on last bytes:removed 2 files from list.104 files left. Now eliminating candidates based on sha1 checksum:

The problem is these are quite huge but often the first and last bytes are equal, so rdfind goes into full checksum calculation.

Secondary problem is that most of these files are sparse files, which means reading them full reads a lot of zeros.

Would it be possible in addition/instead of the first/last bytes to check the first/last x megabytes? Say 1 meg at the beginning would be sufficient, as this either covers a file system or swap space, which is more likely not equal in case the images differ.

It would mean to sometimes read the x MB in addition, but for my use case it would dramatically reduce the total amount of data read.

kdupke avatar Dec 10 '19 13:12 kdupke

This sounds like a good idea!

What do you think about something along

rdfind -firstbytesize N -lastbytesize M .....

The inner workings have to be modified a bit, so it hashes the first N /last M bytes instead of just copying them.

pauldreik avatar Dec 10 '19 16:12 pauldreik

(sorry for the comment mishap, first try replying through email)

This sounds like a good idea!

What do you think about something along

rdfind -firstbytesize N -lastbytesize M .....

The inner workings have to be modified a bit, so it hashes the first N /last M bytes instead of just copying them.

pauldreik avatar Dec 10 '19 16:12 pauldreik

So your proposed command switch results on

rdfind -firstbytesize 100 -lastbytesize 10 foo bla

read the first 100 bytes and the last 10 bytes of foo and bla. If these are differ drop them from the list of files to check.

Is this a modifier to the regular behavior and N and M defaults to 1? Or is this in addition to the existing tests?

What happens if the file is smaller than N or M? (just saying)

kdupke avatar Dec 10 '19 21:12 kdupke

Oh. Like it ;-)

kdupke avatar Dec 10 '19 21:12 kdupke

Yes please. Ideally with a switch of the kind -hashsize 10Mb

rhortal avatar May 26 '20 20:05 rhortal

I'm also interested in this feature becoming reality.

PoC-dev avatar Aug 22 '22 22:08 PoC-dev