xxHash icon indicating copy to clipboard operation
xxHash copied to clipboard

XXH3 output format, non-BSD style checksum lines

Open therube opened this issue 3 years ago • 5 comments

XXH3 output format, non-BSD style checksum lines

(From reading other posts relating to XXH3), seems you want XXH3 to be "different", so I (am assuming) in that regard, if you use the -H3 cli switch, you are specifically outputting in BSD style checksum lines.

It would be nice though, if you could out also have the option to use the "default" output style; hash on left, name on right. (That format is much more human readable, at least to me - my use case.)

(Not a big deal... But given there is performance improvements over XXH64, & I don't specifically need the longer hash of -H128, XXH3, in a readable format, would fit the bill nicely.

fwiw, i5-3570K 3.40 GHz, 16 GB RAM, Win7 x64:

C:\BIN>xxh -b xxh 0.8.0 by Yann Collet compiled as 64-bit x86_64 autoVec little endian with Clang 10.0.0 (https://github.com/msys2/MINGW-packages.git 7211ffb882cc3b7e7583c518aad45a22b278bc81) Sample of 100 KB...

  • 1#XXH32 : 102400 -> 72110 it/s ( 7042.0 MB/s)
  • 3#XXH64 : 102400 -> 143421 it/s (14006.0 MB/s)
  • 5#XXH3_64b : 102400 -> 192579 it/s (18806.5 MB/s) *11#XXH128 : 102400 -> 192002 it/s (18750.2 MB/s)

C:\BIN>xxh -b xxh 0.8.1 by Yann Collet compiled as 64-bit x86_64 autoVec little endian with GCC 11.2.0 Sample of 100 KB...

  • 1#XXH32 : 102400 -> 72149 it/s ( 7045.8 MB/s)
  • 3#XXH64 : 102400 -> 144270 it/s (14088.9 MB/s)
  • 5#XXH3_64b : 102400 -> 186771 it/s (18239.4 MB/s) *11#XXH128 : 102400 -> 186414 it/s (18204.5 MB/s)

fwiw2, i7-3770S 3.10 GHz, 8 GB RAM, Win7 x64:

xxh 0.8.0 by Yann Collet compiled as 64-bit x86_64 autoVec little endian with Clang 10.0.0 (https://github.com/msys2/MINGW-packages.git 7211ffb882cc3b7e7583c518aad45a22b278bc81) Sample of 100 KB...

  • 1#XXH32 : 102400 -> 68098 it/s ( 6650.2 MB/s)
  • 3#XXH64 : 102400 -> 135702 it/s (13252.1 MB/s)
  • 5#XXH3_64b : 102400 -> 181519 it/s (17726.5 MB/s) *11#XXH128 : 102400 -> 180127 it/s (17590.5 MB/s)

xxh 0.8.1 by Yann Collet compiled as 64-bit x86_64 autoVec little endian with GCC 11.2.0 Sample of 100 KB...

  • 1#XXH32 : 102400 -> 67974 it/s ( 6638.1 MB/s)
  • 3#XXH64 : 102400 -> 136765 it/s (13356.0 MB/s)
  • 5#XXH3_64b : 102400 -> 173442 it/s (16937.7 MB/s) *11#XXH128 : 102400 -> 173938 it/s (16986.2 MB/s) )

Thank you.

therube avatar Dec 03 '21 20:12 therube

The main issue here is to ensure there is no confusion with XXH64. A format with "just" 16 hexa characters on the left, and name of the right, doesn't "tell" if it's XXH64 or XXH3, they both look the same. This is a recipe for confusion, not just for xxhsum, but also for any downstream program using xxhsum.

Cyan4973 avatar Dec 03 '21 21:12 Cyan4973

Some option to ensure "traditional" sum format would be fine. If someone deliberately specifies it, they know what they are doing. I always give hash sum files descriptive extension do I know in a future what that is.

$ xyzsum file > file.xyz
# for example:
$ sha1sum file > file.sha1
$ xxhash -H1 file > file.xxh64

Maybe more importantly, sum-file with hash first, then file name is easier to parse. What if file name contains (, ) or =? You need specialised parser to deal with this particular format.

tansy avatar Dec 28 '21 17:12 tansy

Reflecting back on this topic, I was wondering if a format like : XXH3_1234567890123456 filename would be suitable.

It would respect the left hash / right name format convention, and output would be impossible to confuse with XXH64 output. Having a XXH3_ prefix as part of the output is unusual (not present in md5sum for example), but seems to solve both requirements.

A parser reading the file would still need to recognize and discard the prefix, but that seems trivial to achieve.

Cyan4973 avatar May 06 '22 21:05 Cyan4973

In my opinion there should be separate sum for xxh and xxh3. They are different hashes after all. For example: xxhsum <file> > file.xxh xxh3sum <file> > file.xxh3. The sums would be as simple as every other hash-sums and just as simple to understand and process; (first) hexadecimal sum, separator (space) and file name. Adding additional "obfuscation" does nothing good. I would follow unix philosophy here:

  1. Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new "features".
  2. Expect the output of every program to become the input to another, as yet unknown, program. Don't clutter output with extraneous information. Avoid stringently columnar or binary input formats.

tansy avatar May 11 '22 02:05 tansy

Please consider old VirusTotal/TLSH's solution: TLSH adds optional "T1" to a hash (example). "T" is necessary because it's not HEX, to avoid confusion, "1" is the version.

Likewise-done "X3" would look better than the needlessly longer "XXH3_" tag. In my view it wouldn't make sense to add with the BSD option. Of course hashes without the tag should be accepted too, but maybe require explicit options.

Sample:

XXH3 (filename) = 2d06800538d394c2
x32d06800538d394c2  filename
X32D06800538D394C2  filename

The utility must be able to optionally not put the tag ("not recommended") and accept both with and without "X3" when called explicitly. However, it should be said that generally Tansy's proposal is probably a better cleaner way, if this is possible to do instead, it will be great.

ghost avatar Jun 16 '22 06:06 ghost

@Cyan4973 one option is to use corz .hash format, which just adds a comment before a hash. Here's a sample:

#md5#xxhsum.exe#[email protected]:28
2382c789ff66f780a54a0bd1025d0adb *xxhsum.exe
#sha1#xxhsum.exe#[email protected]:28
b51b24411585cc4cb2a95f9fa7ce1792224fa0a6 *xxhsum.exe
#xxh3#xxhsum.exe#[email protected]:28
bbca567006781b14 *xxhsum.exe
#xxh128#xxhsum.exe#[email protected]:28
917defb595e103a9bbca567006781b14 *xxhsum.exe

malvarenga123 avatar Feb 09 '23 04:02 malvarenga123

My 2 cents on this, as a mere (but interested) user:


The main issue here is to ensure there is no confusion with XXH64. A format with "just" 16 hexa characters on the left, and name of the right, doesn't "tell" if it's XXH64 or XXH3, they both look the same. This is a recipe for confusion, not just for xxhsum, but also for any downstream program using xxhsum.

This should not be a major concern: the same "confusion" also happens between md5sum and xxh128sum and no one cares, as it is expected the user to know which program (and parameters) were used to generate a given data output (using file extension for example, as cited by @tansy). Of course, if there is an easy to avoid ambiguity then great, otherwise don't bother.

The hash path format is much better and preferable, if only for this:

Maybe more importantly, sum-file with hash first, then file name is easier to parse. What if file name contains (, ) or =? You need specialised parser to deal with this particular format.

It should be the default format, but I realize this might be too late for such API-breaking changes.


I liked the "hash prefix" idea, if that's handled automatically by xxhsum's parser. That's the approach taken by the Argon2 crypto suite, and possibly many others:

  • Given the hash 45d7ac72e76f242b20b77b9bf9bf9d5915894e669a24e6c6,
  • It gets encoded as $argon2i$v=19$m=65536,t=2,p=4$c29tZXNhbHQ$RdescudvJCsgt3ub+b+dWRWJTmaaJObG

So the encoded form contains the algo, its parameters, salt, and (base64'ed) hash itself in a single string. I personally dislike $ as a separator, but the idea of outputting everything --verify needs to know is great.

Suggestion: use : as separator, and prefix the line with it, so parser can test for : instead of X (as Argon does, unlike TLSH)


In case of the mentioned corz .hash approach, also consider a single per-file (or per-block) comment instead of per-hash, so parser will switch to that "mode" until next marker. Anyway, I'm on the fence about hijacking comments for syntax-relevant stuff.

MestreLion avatar Mar 05 '23 02:03 MestreLion

The main problem is that a single program handles multiple hash algorithms. Users already "share" hash algorithms in some way. Maybe what we should provide is a split xxhsum-xyz family with the same API?

ghost avatar Jun 07 '23 11:06 ghost

Coming back on this topic, if a non-BSD output format for 64-bit XXH3 is still considered a useful feature to offer, I believe that the proposal to add a XXH3_ prefix to the hash value would solve all the problems listed in this thread. It would also be compatible with other properties requested in this thread, such as a dedicated xxh3sum symlink, featuring an exclusive --check mode, and of course no risk to misinterpret the values with xxh64 ones.

The feature would be relatively straightforward to implement, and maintain, and could be ready on time for v0.8.2 release.

Cyan4973 avatar Jul 15 '23 20:07 Cyan4973

In the interest of ensuring a prompt v0.8.2 release, I'll rather push this item to v0.8.3. It's not more difficult to do, but there is no urgency, and it'll give time to collect comments, if need be.

Cyan4973 avatar Jul 16 '23 10:07 Cyan4973

I publish files and based on feedback I believe YAML syntax is missing on the table. It has a clear, expanding structure, uncomplicated by JSON parentheses, suitable for eye scanning and machine parsing. Here goes a sketch, and let people familiar with the nuances bring it to fruition.

$ type checksums.yaml

---
# Created by <appname> <appversion> on <timestamp>.
files:
  - name: 1984-02-23 son.jxl
    size: 1768619
    xxh3: 73261bda0b305cd9
  - name: archive\1986-08-17 daughter.jxl
    size: 1515416
    xxh3: 2e59a6ffad11201d
...

sergeevabc avatar Jan 14 '24 02:01 sergeevabc