fslint icon indicating copy to clipboard operation
fslint copied to clipboard

[Improvement] Feature to check for files [on an external disk] which are *not* present somewhere on the [backup] disk

Open Wikinaut opened this issue 4 years ago • 3 comments

I wish to have a feature which makes intelligent use of the checksum/hashes of the huge "backup" drive X so that - when I connect a smaller drive Z to my computer - so that I can quickly list all those files which are

  • present on drive Z ; and/but
  • not present on drive X

This is a "one-way" check. I don't want to have the huge list of differences. I only want to know those files from Z which for one reason or another have not been copied (or later moved) to drive X, on any directory there. So basically, it's a checksum/hash issue.

Wikinaut avatar Aug 05 '19 12:08 Wikinaut

Hello, can we talk about such a new feature? If you wish, I can explain again why rsync is not a solution.

It's something like https://askubuntu.com/a/767988

fdupes is an excellent program to find the duplicate files but it does not list the non-duplicate files, which is what you are looking for. However, we can list the files that are not in the fdupes output using a combination of find and grep.

Wikinaut avatar Sep 13 '19 07:09 Wikinaut

OK an rsync solution should work if the structure in the dest was similar to that in the source. I.E. something like rsync -rl --dry-run --out-format="%f" --checksum Z/ X/

So I presume the structure of your source Z is different to that in dest X. I.E. you want to list files not backed up, no matter where they are in Z, so that you can copy them to the appropriate location in X etc.

So you want the equivalent of the following, but with more efficient handling of unique file sizes etc:

    $ SRC=Z/; DST=X/
    $ find $SRC $DST -type f | xargs md5sum | sed "\|  $DST|p" |
      sort | uniq -w32 -u | cut -d' ' -f3

One could avoid the overhead of scanning and checksumming $DST if it was not updated between fslint dedupe runs. In that case fslint could write and index of size,checksum,name which could be used directly in the process above

pixelb avatar Sep 13 '19 09:09 pixelb

Yes, the structure is different, or may be different, so we have to "search" for the file hash.

I also found this proposal for "fdupes" https://github.com/adrianlopezroche/fdupes/issues/19

It would be good to save the hash/parse/analyze information of a specific fdupes run, in order to compare later this "virtual"files tree with a real file tree.


Currently I run the suggested sequence from https://askubuntu.com/a/767988 (see above): to list the files which are unique to backup (Z in my example), i. e. which are in backup but not in documents. [My use case is vice versa: to look for files which are not yet somewhere in the "backup"]

fdupes -r backup/ documents/ > dup.txt
find backup/ -type f | grep -Fxvf dup.txt 

Wikinaut avatar Sep 13 '19 09:09 Wikinaut