Cache hash information across invocations
I have a large set (300k, 80G) of files that is slowly, but constantly expanded with new files, whereas the existing files are not modified. It takes very long time to scan for the duplicates in the root directory each time. Having a switch for incremental deduplication that stores the previous data in ~/.cache/, and takes the user's word that the existing files weren't modified since the last deduplication, would be a great help - the time would be spent only on checking the recent files.
Is the time being spent on the "scanning" phase (where the directory and file counters go up) or on the duplication checking phase that follows it? There is no way to discover the new files without scanning the specified directory trees completely, so the first phase can't really be sped up.
Could you compile with make DEBUG=1 and use the -D switch on your next scan and post the debug stats at the end of the output here? It'll be helpful to see what parts of the matching algorithm are doing with the data.
Oh, the time is spent on the duplication checking phase - the file discovery finishes very quickly. Sorry for being ambiguous.
Stats after dropping disk cache with sync && echo 3 >/proc/sys/vm/drop_caches:
rr-@tornado:~/src/ext/jdupes$ time ./jdupes -DrdN ~/img/net/
Scanning: 294231 files, 365 dirs (in 1 specified)
(...)
104538 partial (+1235 small) -> 539 full hash -> 478 full (103147 partial elim) (0 hash64 fail)
294231 total files, 7334371 comparisons, branch L 4199167, R 2840974, both 7040141
Max tree depth: 48; SMA: allocs 882792, free 542, fail 322, reuse 520, scan 19684665, tails 136
I/O chunk size: 16 KiB (dynamically sized)
./jdupes -DrdN ~/img/net/ 2.01s user 5.76s system 0% cpu 1:06:26.83 total
Stats after consecutive execution without dropping disk cache are similar:
rr-@tornado:~/src/ext/jdupes$ time ./jdupes -DrdN ~/img/net/
Scanning: 293753 files, 365 dirs (in 1 specified)
103910 partial (+1233 small) -> 59 full hash -> 0 full (103000 partial elim) (0 hash64 fail)
293753 total files, 7322383 comparisons, branch L 4192433, R 2836198, both 7028631
Max tree depth: 48; SMA: allocs 881846, free 526, fail 324, reuse 510, scan 18220825, tails 122
I/O chunk size: 16 KiB (dynamically sized)
./jdupes -DrdN ~/img/net/ 2.12s user 5.85s system 0% cpu 1:00:27.25 total
Only 59 of your files are being fully hashed. Most of the time is being used to hash the 4K first block of over 100K files which is a ton of seeks. Do the file modify times change between scans? I could probably code up some routines that would allow saving/loading a text database of files by name, partial_hash, full_hash, mod_time, with the caveat that the working directory would need to be the same between runs. This would prevent recalculation of partial hashes for unmodified files.
Do the file modify times change between scans?
No, they're 100% read-only.
with the caveat that the working directory would need to be the same between runs.
Of course I'm fine with it either way, but maybe the CWD could also be saved to that file (and then restored internally)?
This would prevent recalculation of partial hashes for unmodified files.
Great :)
I was thinking of saving the relative path from the root of the current device as well as the inode number for each file and storing that info in the root of the filesystem (or the current directory as a fallback). This would also allow for cache databases to be stored in multiple places with a specified name and "picked up" by the directory scanning code. Inode + size + time are sufficient information to tell if a file has changed.
An alternative that would work for me and maybe would be simpler to implement, would be something like an --against switch, that would look like this:
jdupes -rdN somewhere/new-files --against somewhere/old-files
jdupes would then compare the files in the somewhere/new-files directory against themselves and the files in the somewhere/old-files directory, but wouldn't test the files in somewhere/old-files against each other.
#54 is indeed a duplicate of this request, it would be probably useful if the cache was created against a relative path, in a sense creating "Volumes" ; and introduce a notion of "Volume Prefix" which is the path leading to that "Volume", thus fdupes -r --create-index indexFile /path/to/files would create indexFile with paths relative to /path/to/files, and using that index would be something like: fdupes --read-index IndexFile:/prefix -r /prefix (in this case prefix would be /path/to/files)
if those files were moved; it would be easy to define an alternative path fdupes --read-index IndexFile:/alternativePath -r /alternative/Path
and to use a sub path of that volume Index: fdupes --read-index IndexFile:/alternativePath -r /alternativePath/Sub/
And I guess then one can define multiple --read-index values.
additionally, if the prefix is stored somewhere in the cache file, it would be then easy to introduce a default behavior, such as --use-cache which would read/write from/to .jduepes/cache/ based on existing caches
@0x2620
I'm using this on a directory that has >2,000,000 files and ~130TB of data. It's a fantastic program and very useful in my case where ~10% of files are duplicated. The scan portion understandably takes a while to complete, but the hashing takes far, far, longer (as expected). This is the progress after days:
root@nas:/mnt/data # jdupes -LrS . Scanning: 2027057 files, 132262 items (in 1 specified) Progress [36502/2027057, 5742 pairs matched] 1% (hashing: 56%)
I'm completely understanding and fine with the initial run taking weeks on this data, but I'd really love a way to cache whatever data can improve speed of future runs. This data is never modified, so wouldn't change. Path, last-modified (for determining which need to be re-hashed?), hash. Maybe an sqlite db?
Read the article on your journey creating this program. Interesting and great work :)
I'm using this on a directory that has >2,000,000 files and ~130TB of data. It's a fantastic program and very useful in my case where ~10% of files are duplicated. The scan portion understandably takes a while to complete, but the hashing takes far, far, longer (as expected). This is the progress after days:
root@nas:/mnt/data # jdupes -LrS . Scanning: 2027057 files, 132262 items (in 1 specified) Progress [36502/2027057, 5742 pairs matched] 1% (hashing: 56%)
I'm completely understanding and fine with the initial run taking weeks on this data, but I'd really love a way to cache whatever data can improve speed of future runs. This data is never modified, so wouldn't change. Path, last-modified (for determining which need to be re-hashed?), hash. Maybe an sqlite db?
Read the article on your journey creating this program. Interesting and great work :)
You might want to have look into 'fclones' tool.
I'm using this on a directory that has >2,000,000 files and ~130TB of data. It's a fantastic program and very useful in my case where ~10% of files are duplicated. The scan portion understandably takes a while to complete, but the hashing takes far, far, longer (as expected). This is the progress after days: root@nas:/mnt/data # jdupes -LrS . Scanning: 2027057 files, 132262 items (in 1 specified) Progress [36502/2027057, 5742 pairs matched] 1% (hashing: 56%) I'm completely understanding and fine with the initial run taking weeks on this data, but I'd really love a way to cache whatever data can improve speed of future runs. This data is never modified, so wouldn't change. Path, last-modified (for determining which need to be re-hashed?), hash. Maybe an sqlite db? Read the article on your journey creating this program. Interesting and great work :)
You might want to have look into 'fclones' tool.
Thanks. I see it has a persistent hash cache option. This will help with future scans.
I was thinking of saving the relative path from the root of the current device as well as the inode number for each file and storing that info in the root of the filesystem (or the current directory as a fallback).
@jbruchon I'm wondering if this ever made it into the code? Thank you!
It has not.
Hi. I think there is good value to this request. My use case includes read-only btrfs snapshots. They are by design immutable, so it would be great if the hashing progress could be saved between runs,or so that a run can be paused and resumed.
I am looking for see a usecase like so:
jdupes --dedupe --cache /root/.jdupes_cache $file [$file2 ...]
This would load jdupes hash with the cache, scan/dedupe only the provided files against the hash and then update/save the cache.
Or perhaps even more efficient would be:
jdupes --dedupe --cache /root/.jdupes_cache --stdin
Where jdupes could take the files from stdin.
in order to build/rebuild the cache, one could:
jdupes --dedupe --cache /root/.jdupes_cache /root
Then one could run some file system monitor in the background to invoke jdupes for incremental updates:
fswatch /root ... | jdupes --dedupe --cache /root/.jdupes_cache --stdin &
(wonderful tool, BTW - saves a ton of space )
I am working on implementing basic hash database functionality right now. I'll post here again once it's in a functional, testable state. This will be bare minimum capability so don't expect much; the first hashdb capability will be sensitive to the current working directory from which jdupes is run and will not do any sort of clever path conversion/translation; it will only store the hashes collected during the current run, plus what was already in the database, minus any invalidated (changed) files. I just finished the hashdb core functions today but I still need to integrate them into the main program flow. Stay tuned.
Basic functionality is working. Anyone who wants this feature needs to pull the hashdb branch and try it out. Right now the database works with the exact paths you specify only; the smarts to do things like canonicalize paths and remap things doesn't exist yet. Try this on a large data set:
jdupes -rqmy . dir
time jdupes -rqm dir
time jdupes -rqmy . dir
The first line preloads disk caches and generates a database jdupes_hashdb.txt in the current directory, then the others benchmark the time it takes to run to completion. Post results!
The hash database performance problem has been fixed, as well as a bug that caused full hashes to not be added. There is still a segmentation fault crash on DB write for huge data sets that will be harder to track down, but in my testing on a data set of 300K random image/video files the hash database caused the file comparison phase (after "scanning") to skip over 150K files that were already in the database when I aborted the program at 150K on a previous run. It was so fast that the progress indicator went straight from "0/300000" to "150000/300000".
Try it out, please! https://github.com/jbruchon/jdupes/commit/79cad07c1523928f3e41cea68a7e3b0fd60bd419
Hash database functionality is working properly. The latest release has it. Six years and the biggest request ever is finally fulfilled! https://github.com/jbruchon/jdupes/releases/tag/v1.27.0