[Feature Request]: Single hash output for folders
As the title says, is it possible to have the option to hash a folder (with all the relative contents) with an output of a single hash? Something like dirhash
I can't find the issue where I detailed why this is not really possible, but after some thinking, I think we can expose a button for something like "Create a virtual sumfile, then hash that" which may work for some limited use cases. The caveat is that it wouldn't be possible to guarantee no Type II errors (ie false negatives). We could guarantee that if two hashes match, the contents match, but not the opposite (if hashes mismatch, the contents are different).
This is just an idea but let's keep it simple: let's say that for folder hash we choose blake3. We could:
- b3sum of each single item of the folder
- concatenate eacn blake3 string of each file (taken in alphabetically order) so we have a big single string of concatenated checksums
- calculate a final blake3 hash of the concatenated string
Your method does not include the filename, therefore any folders that when concatenated alphabetically have the same contented files will have the same hash. Additionally, alphabetical order depends on your current locale. I’d suggest reencoding to UTF-8 (or even better, WTF-8) and just sorting them as byte-strings.
Additionally, if your goal is to define the semantics of when two folders are the same, you’ll hit issues like case sensitivity. A.txt and a.txt is the same by default. However you can enable case sensitivity per-folder on NTFS, which would make them nonequivalent. As a bonus factoid: Case conversion on NTFS is defined in a file. Therefore, you can make a NTFS filesystem where the uppercase of a is B.
This is why false negatives are not preventable.
Actually the goal is a fast way to check if two folders are binary identical, even when filenames are not so in this case i think filename shoudn't be important with this goal in mind. So in this scenario:
Folder 1:
A.txt
B.txt
C.txt
Folder 2: 1.jpg 2.tmp 3.cmd
and A.txt == 1.jpg (same checksum) B.txt == 2.tmp (same checksum) C.txt == 3.cmd (same checksum)
Folder1 hash == Folder 2 hash
Reorganizing/renaming identical items according to the hash for create a folder mirror is another use case i t hink...
Well, that’s just your goal. Others might want something else. That’s why I don’t really like this idea, it’s all specific to everyone's hyper specific use case.
For you the metadata doesn’t matter except specifically the lexical ordering. Maybe for someone not even that matters and just wants to know if two folders contain the same set of files… And another person may care about modify date…