xxHash
xxHash copied to clipboard
XXH3 output format, non-BSD style checksum lines
XXH3 output format, non-BSD style checksum lines
(From reading other posts relating to XXH3), seems you want XXH3 to be "different", so I (am assuming) in that regard, if you use the -H3 cli switch, you are specifically outputting in BSD style checksum lines.
It would be nice though, if you could out also have the option to use the "default" output style; hash on left, name on right. (That format is much more human readable, at least to me - my use case.)
(Not a big deal... But given there is performance improvements over XXH64, & I don't specifically need the longer hash of -H128, XXH3, in a readable format, would fit the bill nicely.
fwiw, i5-3570K 3.40 GHz, 16 GB RAM, Win7 x64:
C:\BIN>xxh -b xxh 0.8.0 by Yann Collet compiled as 64-bit x86_64 autoVec little endian with Clang 10.0.0 (https://github.com/msys2/MINGW-packages.git 7211ffb882cc3b7e7583c518aad45a22b278bc81) Sample of 100 KB...
- 1#XXH32 : 102400 -> 72110 it/s ( 7042.0 MB/s)
- 3#XXH64 : 102400 -> 143421 it/s (14006.0 MB/s)
- 5#XXH3_64b : 102400 -> 192579 it/s (18806.5 MB/s) *11#XXH128 : 102400 -> 192002 it/s (18750.2 MB/s)
C:\BIN>xxh -b xxh 0.8.1 by Yann Collet compiled as 64-bit x86_64 autoVec little endian with GCC 11.2.0 Sample of 100 KB...
- 1#XXH32 : 102400 -> 72149 it/s ( 7045.8 MB/s)
- 3#XXH64 : 102400 -> 144270 it/s (14088.9 MB/s)
- 5#XXH3_64b : 102400 -> 186771 it/s (18239.4 MB/s) *11#XXH128 : 102400 -> 186414 it/s (18204.5 MB/s)
fwiw2, i7-3770S 3.10 GHz, 8 GB RAM, Win7 x64:
xxh 0.8.0 by Yann Collet compiled as 64-bit x86_64 autoVec little endian with Clang 10.0.0 (https://github.com/msys2/MINGW-packages.git 7211ffb882cc3b7e7583c518aad45a22b278bc81) Sample of 100 KB...
- 1#XXH32 : 102400 -> 68098 it/s ( 6650.2 MB/s)
- 3#XXH64 : 102400 -> 135702 it/s (13252.1 MB/s)
- 5#XXH3_64b : 102400 -> 181519 it/s (17726.5 MB/s) *11#XXH128 : 102400 -> 180127 it/s (17590.5 MB/s)
xxh 0.8.1 by Yann Collet compiled as 64-bit x86_64 autoVec little endian with GCC 11.2.0 Sample of 100 KB...
- 1#XXH32 : 102400 -> 67974 it/s ( 6638.1 MB/s)
- 3#XXH64 : 102400 -> 136765 it/s (13356.0 MB/s)
- 5#XXH3_64b : 102400 -> 173442 it/s (16937.7 MB/s) *11#XXH128 : 102400 -> 173938 it/s (16986.2 MB/s) )
Thank you.
The main issue here is to ensure there is no confusion with XXH64
.
A format with "just" 16 hexa characters on the left, and name of the right, doesn't "tell" if it's XXH64
or XXH3
, they both look the same. This is a recipe for confusion, not just for xxhsum
, but also for any downstream program using xxhsum
.
Some option to ensure "traditional" sum format would be fine. If someone deliberately specifies it, they know what they are doing. I always give hash sum files descriptive extension do I know in a future what that is.
$ xyzsum file > file.xyz
# for example:
$ sha1sum file > file.sha1
$ xxhash -H1 file > file.xxh64
Maybe more importantly, sum-file with hash first, then file name is easier to parse. What if file name contains (
, )
or =
? You need specialised parser to deal with this particular format.
Reflecting back on this topic,
I was wondering if a format like :
XXH3_1234567890123456 filename
would be suitable.
It would respect the left hash / right name format convention,
and output would be impossible to confuse with XXH64
output.
Having a XXH3_
prefix as part of the output is unusual (not present in md5sum
for example), but seems to solve both requirements.
A parser reading the file would still need to recognize and discard the prefix, but that seems trivial to achieve.
In my opinion there should be separate sum for xxh and xxh3. They are different hashes after all.
For example:
xxhsum <file> > file.xxh
xxh3sum <file> > file.xxh3
.
The sums would be as simple as every other hash-sums and just as simple to understand and process; (first) hexadecimal sum, separator (space) and file name.
Adding additional "obfuscation" does nothing good.
I would follow unix philosophy here:
- Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new "features".
- Expect the output of every program to become the input to another, as yet unknown, program. Don't clutter output with extraneous information. Avoid stringently columnar or binary input formats.
Please consider old VirusTotal/TLSH's solution: TLSH adds optional "T1" to a hash (example). "T" is necessary because it's not HEX, to avoid confusion, "1" is the version.
Likewise-done "X3" would look better than the needlessly longer "XXH3_" tag. In my view it wouldn't make sense to add with the BSD option. Of course hashes without the tag should be accepted too, but maybe require explicit options.
Sample:
XXH3 (filename) = 2d06800538d394c2
x32d06800538d394c2 filename
X32D06800538D394C2 filename
The utility must be able to optionally not put the tag ("not recommended") and accept both with and without "X3" when called explicitly. However, it should be said that generally Tansy's proposal is probably a better cleaner way, if this is possible to do instead, it will be great.
@Cyan4973 one option is to use corz .hash format, which just adds a comment before a hash. Here's a sample:
#md5#xxhsum.exe#[email protected]:28
2382c789ff66f780a54a0bd1025d0adb *xxhsum.exe
#sha1#xxhsum.exe#[email protected]:28
b51b24411585cc4cb2a95f9fa7ce1792224fa0a6 *xxhsum.exe
#xxh3#xxhsum.exe#[email protected]:28
bbca567006781b14 *xxhsum.exe
#xxh128#xxhsum.exe#[email protected]:28
917defb595e103a9bbca567006781b14 *xxhsum.exe
My 2 cents on this, as a mere (but interested) user:
The main issue here is to ensure there is no confusion with
XXH64
. A format with "just" 16 hexa characters on the left, and name of the right, doesn't "tell" if it'sXXH64
orXXH3
, they both look the same. This is a recipe for confusion, not just forxxhsum
, but also for any downstream program usingxxhsum
.
This should not be a major concern: the same "confusion" also happens between md5sum
and xxh128sum
and no one cares, as it is expected the user to know which program (and parameters) were used to generate a given data output (using file extension for example, as cited by @tansy). Of course, if there is an easy to avoid ambiguity then great, otherwise don't bother.
The hash path
format is much better and preferable, if only for this:
Maybe more importantly, sum-file with hash first, then file name is easier to parse. What if file name contains
(
,)
or=
? You need specialised parser to deal with this particular format.
It should be the default format, but I realize this might be too late for such API-breaking changes.
I liked the "hash prefix" idea, if that's handled automatically by xxhsum
's parser. That's the approach taken by the Argon2 crypto suite, and possibly many others:
- Given the hash
45d7ac72e76f242b20b77b9bf9bf9d5915894e669a24e6c6
, - It gets encoded as
$argon2i$v=19$m=65536,t=2,p=4$c29tZXNhbHQ$RdescudvJCsgt3ub+b+dWRWJTmaaJObG
So the encoded form contains the algo, its parameters, salt, and (base64'ed) hash itself in a single string. I personally dislike $
as a separator, but the idea of outputting everything --verify
needs to know is great.
Suggestion: use :
as separator, and prefix the line with it, so parser can test for :
instead of X
(as Argon does, unlike TLSH)
In case of the mentioned corz .hash
approach, also consider a single per-file (or per-block) comment instead of per-hash, so parser will switch to that "mode" until next marker. Anyway, I'm on the fence about hijacking comments for syntax-relevant stuff.
The main problem is that a single program handles multiple hash algorithms.
Users already "share" hash algorithms in some way.
Maybe what we should provide is a split xxhsum-xyz
family with the same API?
Coming back on this topic,
if a non-BSD output format for 64-bit XXH3
is still considered a useful feature to offer,
I believe that the proposal to add a XXH3_
prefix to the hash value would solve all the problems listed in this thread.
It would also be compatible with other properties requested in this thread, such as a dedicated xxh3sum
symlink, featuring an exclusive --check
mode, and of course no risk to misinterpret the values with xxh64
ones.
The feature would be relatively straightforward to implement, and maintain, and could be ready on time for v0.8.2
release.
In the interest of ensuring a prompt v0.8.2
release, I'll rather push this item to v0.8.3
.
It's not more difficult to do, but there is no urgency, and it'll give time to collect comments, if need be.
I publish files and based on feedback I believe YAML syntax is missing on the table. It has a clear, expanding structure, uncomplicated by JSON parentheses, suitable for eye scanning and machine parsing. Here goes a sketch, and let people familiar with the nuances bring it to fruition.
$ type checksums.yaml
---
# Created by <appname> <appversion> on <timestamp>.
files:
- name: 1984-02-23 son.jxl
size: 1768619
xxh3: 73261bda0b305cd9
- name: archive\1986-08-17 daughter.jxl
size: 1515416
xxh3: 2e59a6ffad11201d
...