rdfind
rdfind copied to clipboard
Allow setting SomeByteSize for first/last bytes checks
I tried rdfind with firmware update files. The problem with these is that the first 1000 Bytes are identical and even the last 64 (current default in Fileinfo.hh
) don't differ that much between the files. So rdfind resorts to calculating checksums which takes a long time (large files) in comparison.
Could you maybe add a parameter to set the SomeByteSize
value to higher values?
That sounds like a good idea - I think I also would need to hash the first SomeByteSize bytes in case they exceed the hash buffer size, which may slow things down.
It would be interesting to hear about your use case - how many duplicate files are there, how big are they? Thinking about how this partial comparisons could be improved further.
Basically I've mirrored http://gawisp.com/perry/ and since one firmware often works for multiple devices, it appears in multiple files that are all the same. E.g. from the fenix_D2_tactix/
directory, the D2Delta_520.gcd
, D2DeltaS_520.gcd
and D2DeltaPX_520.gcd
are identical. But while these firmwares are ~10 MiB in size, firmwares for other devices can be up to 600 MiB.
The first/last byte check does not seem very efficient, at least for me:
Now have 512092 files in total.
Total size is 3869933516438 bytes or 4 TiB
Removed 10321 files due to unique sizes from list.501771 files left.
Now eliminating candidates based on first bytes:removed 1252 files from list.500519 files left.
Now eliminating candidates based on last bytes:removed 332 files from list.500187 files left.
If you add this option it would be nice if 0
could turn that feature off.
I recently processed a very large collection of small files, most about several megabytes. Opening each file in turn carries of great deal of overhead, and on most modern systems, the difference between processing a kilobyte versus a few megabytes in negligible. It may be useful as an optimization to read a full file on the first pass if it is not large, and to cache the full-file checksum for the later stage, if needed. It may also be helpful for large data sets to collect both the header and footer on the same pass, instead of two separate ones.
I do agree that 1000 bytes is a small size, given the easiness of reading much more data very quickly on a modern system. The data may be compared as a checksum instead of a the raw data, the same as done for the file contents, for efficiency, as suggested above.