jdupes Skip hashing and improve file comparison

I think this issue will get closed for the same reason as issue #131, i.e. jdupes seems to currently rely heavily on the hashing phase to build a search tree if I understand correctly. But I'd like to propose a byte-by-byte file comparison algorithms that could completely replace the first-block comparison, the hashing and the full file comparison steps.

If I understand correctly, currently, after some grouping by size and other criteria, all the files are read once completely to hash their contents. When hashes match, the files are read again to check their actual content, which almost always match. This means that if there are N files among which k are duplicates of some others, there will be N+2k complete file read. So, best case, N full read, worst case, almost 3N full read. Which can be quite a bummer for performance when the files are large, the storage is slow and the duplcates not-so-rare. Which is precisely my case.

It should be theoretically possible to read all the files at most once in any case. I propose to do that with some kind of implicit prefix tree here.

I call PMGroup a group of files whose prefix match up to some offset. The offset is an attribute of the group. PMGroup stands for Prefix Match or Potential Match or Partial Match, your choice. ^^

We first make a single group for all the files for which the content has to be checked (same length, same permissions, etc). Then:

For every file in a group G, read the next block of data of each file (starting at the beginning)
Sort those blocks
Split the files into separate PMGroups based on having an identical block of data. (Prune groups that would contain a single file.)
Take any existing PMGroup as G and restart step 1.

We get a match when in step 1, there's no more data to read. Then all the files in G match.

With this algorithm, each file is read only once at most. Often not even completely. Which should already be much faster than just the hashing step. At least in theory. The worst complexity step is the block sorting, which should be ok in almost every case, it replaces the building of the search tree, which is also an O(n log(n)) operation.

The main tricky point though, is choosing the block size in step 1 that would fit in RAM but not destroy the IO performance. A 1MB block in a group of 1000 files is already 1GB. To mitigate this issue the block size could be choosen dynamically based on the group size such that a constant amount of memory is used. This shouldn't be much of an issue though since I'd expect the group sized to decrease very quicky as we progress though the files length.

So, what do you think?

Oct 03 '21 03:10 Celelibi

Partial hashing before full hashing: https://github.com/jbruchon/jdupes/blob/master/jdupes.c#L1349

There is a lot of room for optimizations and the hash tree does get in the way of that, but it's not the only thing in the way. The program really needs a rewrite from scratch.

Oct 03 '21 04:10 jbruchon

Indeed, the first block hashing would catch a lot of cases. Although, I understand some people have complained about 4kB not being enough.

But my concern is about files that pass this check, doesn't matter whether they have a large header or are true matches. They make up the bulk of the time spent in my case: dedup video files across backups and I use the options -rIL. Although now that I'm testing on the whole file system, partial hashing seems to do a pretty good job, indeed.

I guess partial hashing could also help in the case where a group has too many files to read a significant block size.

In any case, I understand this kind of improvement won't be done in the near future.

Oct 03 '21 17:10 Celelibi

I have run quite a few tests on first 4K blocks and I have found that the vast majority of non-dupe files that I had access to didn't have identical first blocks. It's entirely possible that some files have large headers with identical first blocks plus identical sizes, but my real-world data sets have never included any significant amount of such files. I chose 4K because it's the sector size of all modern hard drives, it is enough to reject the vast majority of non-duplicate pairs in my testing, and it's small enough to not read too far into a bunch of files for potentially zero benefit.

You can recompile the program with a bigger partial hash size if you want to, so it reads more header per file.

Oct 04 '21 04:10 jbruchon

I ran a test on my data, I do have some identical first block with distinct files, but they indeed have a distinct size.

jdupes doesn't seem to have timing statistics built-in. So I modified it to compute the cumulated time spent hashing the full files and confirming matches in addition to the total execution time and ran another test.

Here's the command and the result:

# time /home/celelibi/code/jdupes/jdupes -rOHDpIm -X size+=:500M media backup/*/*/home/*/*video
Scanning: 916 files, 688 items (in 9 specified)
156 duplicate files (in 34 sets), occupying 115837 MB

63 partial (+0 small) -> 57 full hash -> 57 full (6 partial elim) (0 hash64 fail)
916 total files, 10986 comparisons, branch L 4700, R 5371, both 10071, max tree depth 237
SMA: allocs 14343, free 26270 (merge 7075, repl 3719), fail 933, reuse 15466, scan 311462, tails 4
I/O chunk size: 16 KiB (dynamically sized)
time spent full-hashing: 2381.172964 seconds
time spent byte-comparing: 1283.311699 seconds
total time: 3665.234196 seconds
/home/celelibi/code/jdupes/jdupes -rOHDpIm -X size+=:500M media   37,47s user 176,24s system 5% cpu 1:01:08,84 total

Basically, more than 99% of the time is spent hashing the full files or comparing the files byte-to-byte. I guess the byte-comparison is approxiately twice faster than the full hashing because usually at least one of the files is still in the kernel's disk buffer.

My hypothesis is that if we implement the algorithm I propose, we would get a performance identical to skipping the byte-comparison (like with option -Q), but without taking any chance with the hash collisions. Which should be either a net improvement or not change anything for the cases where the run time is not dominated by the hashing and confirming steps.

Oct 06 '21 16:10 Celelibi

Of course 99% of the time is spent hashing. The existing algorithm is so highly optimized that the program is almost exclusively I/O bound. The only improvements to be made involve I/O. I've looked into performing a parallel hash+compare operations as a performance booster, but for that to actually make a difference, the I/O needs to be threaded too, and doing cross-platform threading without bringing in external dependencies is...unpleasant.

Oct 06 '21 16:10 jbruchon

It indeed is IO bound. But not necessarily on the full hashing or comparison. Depending on the case, it might spent most of its time listing directories and reading metadata. But that's indeed not the case I'm interested in.

I'm not sure what you mean by threaded IO. If you're thinking of usual threads, you probably want to avoid going down that route. Threads are very much not required in an IO bound program. They would mostly bring non-detminism, hard-to-follow logic when they interact and non-reproducible bugs. There are other ways to have asynchronous IO that are less error-prone. But in your case, if you want to do the hashing and the comparison at the same time, why not just perform them at the same time in the same loop? Nothing fancy required for that.

BTW, how would this hashing+comparison work? Would the first file get hashed as usual and then when another file needs to be hashed it would perform the comparison at the same time? Or would the two files get hashed and compared at the same time?

In the first case I doubt it would save much time since (if I'm not mistaken) in the current code, the second file is compared immediately after being hashed, it therefore has a very high chance of still being in the kernel's disk cache. I guess only very large files that don't fit in the cache would benefit from that. But I'd be happy to be proven wrong.

The second case would seem have a higher chance of saving some IO since it is less likely that both files are still in cache. But how does it generalize to more than 2 files? If they are all hashed and compared at the same time, then the hashing becomes useless. My proposal here is actually a refinement of this idea.

Oct 06 '21 20:10 Celelibi