OpenHashTab icon indicating copy to clipboard operation
OpenHashTab copied to clipboard

[Feature Request]: Single hash output for folders

Open graphixillusion opened this issue 8 months ago • 5 comments

As the title says, is it possible to have the option to hash a folder (with all the relative contents) with an output of a single hash? Something like dirhash

graphixillusion avatar Jul 04 '25 15:07 graphixillusion

I can't find the issue where I detailed why this is not really possible, but after some thinking, I think we can expose a button for something like "Create a virtual sumfile, then hash that" which may work for some limited use cases. The caveat is that it wouldn't be possible to guarantee no Type II errors (ie false negatives). We could guarantee that if two hashes match, the contents match, but not the opposite (if hashes mismatch, the contents are different).

namazso avatar Jul 04 '25 16:07 namazso

This is just an idea but let's keep it simple: let's say that for folder hash we choose blake3. We could:

  1. b3sum of each single item of the folder
  2. concatenate eacn blake3 string of each file (taken in alphabetically order) so we have a big single string of concatenated checksums
  3. calculate a final blake3 hash of the concatenated string

graphixillusion avatar Jul 04 '25 18:07 graphixillusion

Your method does not include the filename, therefore any folders that when concatenated alphabetically have the same contented files will have the same hash. Additionally, alphabetical order depends on your current locale. I’d suggest reencoding to UTF-8 (or even better, WTF-8) and just sorting them as byte-strings.

Additionally, if your goal is to define the semantics of when two folders are the same, you’ll hit issues like case sensitivity. A.txt and a.txt is the same by default. However you can enable case sensitivity per-folder on NTFS, which would make them nonequivalent. As a bonus factoid: Case conversion on NTFS is defined in a file. Therefore, you can make a NTFS filesystem where the uppercase of a is B.

This is why false negatives are not preventable.

namazso avatar Jul 04 '25 18:07 namazso

Actually the goal is a fast way to check if two folders are binary identical, even when filenames are not so in this case i think filename shoudn't be important with this goal in mind. So in this scenario:

Folder 1: A.txt
B.txt
C.txt

Folder 2: 1.jpg 2.tmp 3.cmd

and A.txt == 1.jpg (same checksum) B.txt == 2.tmp (same checksum) C.txt == 3.cmd (same checksum)

Folder1 hash == Folder 2 hash

Reorganizing/renaming identical items according to the hash for create a folder mirror is another use case i t hink...

graphixillusion avatar Jul 04 '25 19:07 graphixillusion

Well, that’s just your goal. Others might want something else. That’s why I don’t really like this idea, it’s all specific to everyone's hyper specific use case.

For you the metadata doesn’t matter except specifically the lexical ordering. Maybe for someone not even that matters and just wants to know if two folders contain the same set of files… And another person may care about modify date…

namazso avatar Jul 04 '25 19:07 namazso