fdupes
fdupes copied to clipboard
Cache hashes (based on path and modification time)
Maybe it would be usefull to have some kind of hash cache in ~/.cache/fdupes
that would allow to SAFELY reuse hashes if file modification time was not changed. Both rsync
and make
are doing this all the time. (they do not cache, but they compare files based on modification time)
You cannot safely detect modification by looking at the modification time. This is easily reset/modified ("touched") in filesystem.
So such a "feature" is contrary to fdupes aim - detecting securely and fast any real duplicates.
It's not really going to be problem. Because typical "touch" emulates file modification. So that would invalidate any cached hashes. On the other hand to really fool the fdupes to think that file has not changed, while it has you'd have to modify it and then touch it with parameter setting file modification time to the past exactly to the time that it had before and it had cached. If the file will have modification date before cached date (that may happen when there's problem with system clock or due to NTP failure) it can be perceived as invalid and fdupes can proceed as usual.
Doing something like this can be safely assumed stupid and it's something that nobody who cares about his data will do. Because it may lead to data loss when using rsync and other backup tools, which is widely used far more often than fdupes. Also rsync is still perceived as popular and secure way to sync data even while using this strategy.
Anyway... This can be implemented as optional feature activated by some commandline flag. And time will show us how popular it gets. I am using fdupes on storage where people don't have direct access to filesystem, so nobody can mess with timestamps, screwing around and touching files into the past, so this makes big sense for my use case. Also the storage is mostly huge amount of media files that are moved and copied between directories from time to time, but their content is never really modified, so that's out of concern.
Concur with @Harvie. If it's an optional flag, then the syadmin gets to choose the tradeoff. When it's a system that I maintain privately, I would love to be able to cache the results, especially for large files, and tell fdupes "it's OK, assume no one is monkeying with timestamps" or perhaps also "check the first and last X bytes as a crosscheck to see if the cached checksum needs to be refreshed."
My workflow when using fdupes would be massively improved if I could run it much more quickly (on subsequent runs). It would enable me to run it much more often.
I've finally implemented a version of this using SQLite. The code looks at file creation time, modification time, size, and inode to determine whether a file's contents are likely to have changed. If none of these values have changed the code assumes the hash remains the same, which could lead to missed duplicates (the old hash doesn't match any files but the new one would) or false matches (the old hash matches but the new one wouldn't). Neither condition should result in lost data, however, in the latter case because fdupes will check for an actual match by doing a byte-for-byte comparison before reporting a match.
Solved by ab5ef95e2b2633d0ca1a5ae0b8ac41abde160100.
That sounds like a great approach!
Correction: the code doesn't look at file creation time but at the node change time (ctime). The advantage of this is that ctime, unlike modification time, is very hard to fake. You'd have to either change the system clock and time your file changes just right or edit the filesystem metadata directly. In that case it would be easier for a malicious user to just edit the database instead.