jdupes icon indicating copy to clipboard operation
jdupes copied to clipboard

Cache hash information across invocations

Open rr- opened this issue 8 years ago • 7 comments

I have a large set (300k, 80G) of files that is slowly, but constantly expanded with new files, whereas the existing files are not modified. It takes very long time to scan for the duplicates in the root directory each time. Having a switch for incremental deduplication that stores the previous data in ~/.cache/, and takes the user's word that the existing files weren't modified since the last deduplication, would be a great help - the time would be spent only on checking the recent files.

rr- avatar Mar 04 '17 18:03 rr-

Is the time being spent on the "scanning" phase (where the directory and file counters go up) or on the duplication checking phase that follows it? There is no way to discover the new files without scanning the specified directory trees completely, so the first phase can't really be sped up.

Could you compile with make DEBUG=1 and use the -D switch on your next scan and post the debug stats at the end of the output here? It'll be helpful to see what parts of the matching algorithm are doing with the data.

jbruchon avatar Mar 04 '17 18:03 jbruchon

Oh, the time is spent on the duplication checking phase - the file discovery finishes very quickly. Sorry for being ambiguous.

Stats after dropping disk cache with sync && echo 3 >/proc/sys/vm/drop_caches:

rr-@tornado:~/src/ext/jdupes$ time ./jdupes -DrdN ~/img/net/
Scanning: 294231 files, 365 dirs (in 1 specified)
(...)
104538 partial (+1235 small) -> 539 full hash -> 478 full (103147 partial elim) (0 hash64 fail)
294231 total files, 7334371 comparisons, branch L 4199167, R 2840974, both 7040141
Max tree depth: 48; SMA: allocs 882792, free 542, fail 322, reuse 520, scan 19684665, tails 136
I/O chunk size: 16 KiB (dynamically sized)
./jdupes -DrdN ~/img/net/  2.01s user 5.76s system 0% cpu 1:06:26.83 total

Stats after consecutive execution without dropping disk cache are similar:

rr-@tornado:~/src/ext/jdupes$ time ./jdupes -DrdN ~/img/net/
Scanning: 293753 files, 365 dirs (in 1 specified)

103910 partial (+1233 small) -> 59 full hash -> 0 full (103000 partial elim) (0 hash64 fail)
293753 total files, 7322383 comparisons, branch L 4192433, R 2836198, both 7028631
Max tree depth: 48; SMA: allocs 881846, free 526, fail 324, reuse 510, scan 18220825, tails 122
I/O chunk size: 16 KiB (dynamically sized)
./jdupes -DrdN ~/img/net/  2.12s user 5.85s system 0% cpu 1:00:27.25 total

rr- avatar Mar 04 '17 20:03 rr-

Only 59 of your files are being fully hashed. Most of the time is being used to hash the 4K first block of over 100K files which is a ton of seeks. Do the file modify times change between scans? I could probably code up some routines that would allow saving/loading a text database of files by name, partial_hash, full_hash, mod_time, with the caveat that the working directory would need to be the same between runs. This would prevent recalculation of partial hashes for unmodified files.

jbruchon avatar Mar 04 '17 21:03 jbruchon

Do the file modify times change between scans?

No, they're 100% read-only.

with the caveat that the working directory would need to be the same between runs.

Of course I'm fine with it either way, but maybe the CWD could also be saved to that file (and then restored internally)?

This would prevent recalculation of partial hashes for unmodified files.

Great :)

rr- avatar Mar 04 '17 21:03 rr-

I was thinking of saving the relative path from the root of the current device as well as the inode number for each file and storing that info in the root of the filesystem (or the current directory as a fallback). This would also allow for cache databases to be stored in multiple places with a specified name and "picked up" by the directory scanning code. Inode + size + time are sufficient information to tell if a file has changed.

jbruchon avatar Mar 04 '17 23:03 jbruchon

An alternative that would work for me and maybe would be simpler to implement, would be something like an --against switch, that would look like this:

jdupes -rdN somewhere/new-files --against somewhere/old-files

jdupes would then compare the files in the somewhere/new-files directory against themselves and the files in the somewhere/old-files directory, but wouldn't test the files in somewhere/old-files against each other.

rr- avatar Apr 24 '17 17:04 rr-

#54 is indeed a duplicate of this request, it would be probably useful if the cache was created against a relative path, in a sense creating "Volumes" ; and introduce a notion of "Volume Prefix" which is the path leading to that "Volume", thus fdupes -r --create-index indexFile /path/to/files would create indexFile with paths relative to /path/to/files, and using that index would be something like: fdupes --read-index IndexFile:/prefix -r /prefix (in this case prefix would be /path/to/files)

if those files were moved; it would be easy to define an alternative path fdupes --read-index IndexFile:/alternativePath -r /alternative/Path

and to use a sub path of that volume Index: fdupes --read-index IndexFile:/alternativePath -r /alternativePath/Sub/

And I guess then one can define multiple --read-index values.

additionally, if the prefix is stored somewhere in the cache file, it would be then easy to introduce a default behavior, such as --use-cache which would read/write from/to .jduepes/cache/ based on existing caches

@0x2620

maysara avatar Jul 15 '17 12:07 maysara

I'm using this on a directory that has >2,000,000 files and ~130TB of data. It's a fantastic program and very useful in my case where ~10% of files are duplicated. The scan portion understandably takes a while to complete, but the hashing takes far, far, longer (as expected). This is the progress after days:

root@nas:/mnt/data # jdupes -LrS . Scanning: 2027057 files, 132262 items (in 1 specified) Progress [36502/2027057, 5742 pairs matched] 1% (hashing: 56%)

I'm completely understanding and fine with the initial run taking weeks on this data, but I'd really love a way to cache whatever data can improve speed of future runs. This data is never modified, so wouldn't change. Path, last-modified (for determining which need to be re-hashed?), hash. Maybe an sqlite db?

Read the article on your journey creating this program. Interesting and great work :)

Jorsher avatar Jan 08 '23 22:01 Jorsher

I'm using this on a directory that has >2,000,000 files and ~130TB of data. It's a fantastic program and very useful in my case where ~10% of files are duplicated. The scan portion understandably takes a while to complete, but the hashing takes far, far, longer (as expected). This is the progress after days:

root@nas:/mnt/data # jdupes -LrS . Scanning: 2027057 files, 132262 items (in 1 specified) Progress [36502/2027057, 5742 pairs matched] 1% (hashing: 56%)

I'm completely understanding and fine with the initial run taking weeks on this data, but I'd really love a way to cache whatever data can improve speed of future runs. This data is never modified, so wouldn't change. Path, last-modified (for determining which need to be re-hashed?), hash. Maybe an sqlite db?

Read the article on your journey creating this program. Interesting and great work :)

You might want to have look into 'fclones' tool.

EsEnZeT avatar Jan 09 '23 12:01 EsEnZeT

I'm using this on a directory that has >2,000,000 files and ~130TB of data. It's a fantastic program and very useful in my case where ~10% of files are duplicated. The scan portion understandably takes a while to complete, but the hashing takes far, far, longer (as expected). This is the progress after days: root@nas:/mnt/data # jdupes -LrS . Scanning: 2027057 files, 132262 items (in 1 specified) Progress [36502/2027057, 5742 pairs matched] 1% (hashing: 56%) I'm completely understanding and fine with the initial run taking weeks on this data, but I'd really love a way to cache whatever data can improve speed of future runs. This data is never modified, so wouldn't change. Path, last-modified (for determining which need to be re-hashed?), hash. Maybe an sqlite db? Read the article on your journey creating this program. Interesting and great work :)

You might want to have look into 'fclones' tool.

Thanks. I see it has a persistent hash cache option. This will help with future scans.

Jorsher avatar Jan 09 '23 14:01 Jorsher

I was thinking of saving the relative path from the root of the current device as well as the inode number for each file and storing that info in the root of the filesystem (or the current directory as a fallback).

@jbruchon I'm wondering if this ever made it into the code? Thank you!

patrickwolf avatar Mar 17 '23 21:03 patrickwolf

It has not.

jbruchon avatar Mar 17 '23 21:03 jbruchon

Hi. I think there is good value to this request. My use case includes read-only btrfs snapshots. They are by design immutable, so it would be great if the hashing progress could be saved between runs,or so that a run can be paused and resumed.

Forza-tng avatar May 08 '23 09:05 Forza-tng

I am looking for see a usecase like so:

jdupes --dedupe --cache /root/.jdupes_cache $file [$file2 ...]

This would load jdupes hash with the cache, scan/dedupe only the provided files against the hash and then update/save the cache.

Or perhaps even more efficient would be:

jdupes --dedupe --cache /root/.jdupes_cache --stdin

Where jdupes could take the files from stdin.

in order to build/rebuild the cache, one could:

jdupes --dedupe --cache /root/.jdupes_cache /root

Then one could run some file system monitor in the background to invoke jdupes for incremental updates:

fswatch /root ... | jdupes --dedupe --cache /root/.jdupes_cache --stdin &

(wonderful tool, BTW - saves a ton of space )

eriede avatar Jun 12 '23 20:06 eriede

I am working on implementing basic hash database functionality right now. I'll post here again once it's in a functional, testable state. This will be bare minimum capability so don't expect much; the first hashdb capability will be sensitive to the current working directory from which jdupes is run and will not do any sort of clever path conversion/translation; it will only store the hashes collected during the current run, plus what was already in the database, minus any invalidated (changed) files. I just finished the hashdb core functions today but I still need to integrate them into the main program flow. Stay tuned.

jbruchon avatar Aug 21 '23 23:08 jbruchon

Basic functionality is working. Anyone who wants this feature needs to pull the hashdb branch and try it out. Right now the database works with the exact paths you specify only; the smarts to do things like canonicalize paths and remap things doesn't exist yet. Try this on a large data set:

jdupes -rqmy . dir
time jdupes -rqm dir
time jdupes -rqmy . dir

The first line preloads disk caches and generates a database jdupes_hashdb.txt in the current directory, then the others benchmark the time it takes to run to completion. Post results!

jbruchon avatar Aug 22 '23 19:08 jbruchon

The hash database performance problem has been fixed, as well as a bug that caused full hashes to not be added. There is still a segmentation fault crash on DB write for huge data sets that will be harder to track down, but in my testing on a data set of 300K random image/video files the hash database caused the file comparison phase (after "scanning") to skip over 150K files that were already in the database when I aborted the program at 150K on a previous run. It was so fast that the progress indicator went straight from "0/300000" to "150000/300000".

Try it out, please! https://github.com/jbruchon/jdupes/commit/79cad07c1523928f3e41cea68a7e3b0fd60bd419

jbruchon avatar Aug 24 '23 19:08 jbruchon

Hash database functionality is working properly. The latest release has it. Six years and the biggest request ever is finally fulfilled! https://github.com/jbruchon/jdupes/releases/tag/v1.27.0

jbruchon avatar Aug 25 '23 07:08 jbruchon