coreutils
coreutils copied to clipboard
cksum: failing GNU tests
The latest version of GNU's tests add tests/misc/cksum, tests/misc/cksum-a, tests/misc/cksum-c, and tests/misc/sm3sum, all of which currently fail. This will track the respective problems from each.
uu_cksum correctly implements cksum 8.30, which only has --help and --version. However, the GNU Coreutils 9.0 docs for cksum list the following options:
- [x]
-a/--algorithm - [ ]
--debug - [x]
--untagged - All options supported by
b2sum:- [ ]
-l/--length - All options supported by
md5sum:- ~~
-b/--binary~~ (unsupported forcksum, always on) - [x]
-c/--check- [ ]
cksumauto-detects the algorithm if given data in the--tagformat. - [ ]
cksumdoes not support--checkwith-a sysv,-a bsd, or-a crc.
- [ ]
- [ ]
--ignore-missing - [x]
--quiet - [x]
--status - [x]
--tag- [ ]
cksumuses this as its default output format (use--untaggedfor old format).
- [ ]
- ~~
-t/--text~~ (unsupported forcksum, always off) - [x]
-w/--warn - [x]
--strict - [x]
-z/--zero
- ~~
- [ ]
So uu_cksum is correct for Coreutils 8.30 compatibility, but for Coreutils 9.0 compatibility, we need to flesh it out.
Investigating the codebase a bit further, I see that both md5sum and b2sum are implemented via hashsum. This suggests we should make cksum also be implemented by hashsum. I've checked off above all the options currently implemented by hashsum.
Looks like test/misc/b2sum is also failing, both because hashsum doesn't support -l/--length, and cksum doesn't support -a blake2b.
I'm going to try implementing this.
I made a bunch of progress on this and then ran out of time to finish the PRs. Picking this back up now; I'll rebase my current local work and get some initial PRs up.
Ugh, this is annoying: b3sum was added in https://github.com/uutils/coreutils/pull/3108, and its corresponding upstream binary defines -l/--length in bytes, whereas GNU Coreutils b2sum defines -l/--length in bits.
There are three options here:
- Define
-l/--lengthas bytes to match upstreamb3sum. Given that compatibility with GNU Coreutils is being prioritised (per this issue), I think that would not be considered acceptable asb2sumwould then never pass GNU tests. - Define
-l/--lengthas bits to match GNU Coreutils. I really dislike this option, for two reasons:--lengthbeing bits means it needs to be a multiple of 8, so 7/8 of values that upstreamb3sumaccepts will be rejected here. That's not so bad, it makes the difference discoverable.- In the cases where an upstream length is a multiple of 8,
b3sumhere would output 1/8 of the length, e.g. instead of a 16-byte hash the user gets a 16-bit hash (2 bytes). This is a TERRIBLE silent security issue.
- Define
-l/--lengthas bits when used withb2sumorcksum, and bytes when used withb3sum. This is awful for UX; in particular, it is undefined howhashsumshould treat it. The least-bad option might be to just special-case the byte interpretation tob3sum, and not allow it to be used with-a blake3onhashsum(otherwisehashsumwould almost certainly end up conflicting withcksumif the latter adds BLAKE3 support).
We have a somewhat-similar issue in that hashsum itself defines a separate --bits flag for its SHA3 and SHAKE support, which is effectively duplicative of GNU Coreutils -l/--length. I'm not going to try solve that here.
cc @oconnor663 who wrote b3sum
While I really want GNU's compatibility, we workaround with upstream testsuite from time to time. See https://github.com/uutils/coreutils/blob/main/util/build-gnu.sh#L121= and below
Note that the suprising behaviour you mentioned in option 2 not only happens between b3sum implementations but also between different hashing algorithms. I.e. if I use the external b3sum first and then (GNU) b2sum, then the length does not match either. Hence, I think we bring back the consistency and should build what a hypothetical GNU b3sum would look like, not what another b3sum already does, to ensure compatibility/consistency with GNU and with the other utils in uutils.
I like the first option too, but it would indeed break GNU compatibility, so like Sylvestre mentioned, we'd have to work around the tests. Not to mention that it could be surprising for GNU users (although we'd err on the secure side).
Yeah, the inconsistency blows. Here are some scattered thoughts about why it ended up the way it is:
-
Unfortunately, different BLAKE2 interfaces have been inconsistent for a long time. CLIs and written naming conventions tend to use bits, but Python's
hashlib.blake2bAPI takes thedigest_sizeparameter in bytes. Similarly, C APIs usually take asize_tsomewhere that counts bytes. -
Measuring BLAKE2 output in bits makes sense from a security perspective. The security properties of BLAKE2b-512 and BLAKE2b-256 mirror those of SHA-512 and SHA-256. However, this relationship relies somewhat on BLAKE2b not allowing output sizes larger than its internal state size. For example,
b2sum -l 1024will just error out. The same doesn't apply to BLAKE3, though, which has a built-in XOF that'll give you as many output bytes as you like. "BLAKE3-512" (if we were to call it that) does not provide more security than the default 256-bit BLAKE3. I find this is a common source of confusion, so I discourage naming BLAKE3 instances in terms of bits, even though it can make sense on the small end. -
Another way BLAKE2 is similar to SHA-2 is that BLAKE2b-512 and BLAKE2b-256 are independent ("domain separated") hash functions. Changing the output length gives you a completely unrecognizable hash. This is a natural design for those functions, since they can mix the output length into state initialization one way or another. But for an XOF like BLAKE3, it's pretty common for the output length to be "who knows, we're gonna use this as a stream cipher", so domain-separating outputs of different lengths isn't always possible. Rather than having two modes for this, as BLAKE2X did, in BLAKE3 shorter outputs are always prefixes of longer ones. This is another reason why I discourage bit-counting names like "BLAKE3-512", to avoid implying that variants with different names are independent of each other.
Which is all to say, that when there's a choice between counting bits and counting bytes, BLAKE3 pretty strongly prefers to count bytes. It keeps CLIs and library APIs consistent with each other, and it avoids some common points of confusion. But sometimes there isn't a choice, like our situation here, and then we suffer some inconsistency.
Interesting! That does make it clearer as to why b3sum is different. Maybe a solution could be to make a split API. So b2sum (and the other ones GNU supports) would accept a --bits option and -l would work but be deprecated/hidden. b3sum would accept a --bytes option and using the wrong argument would result in an error. However, a downside to this is that we break compat with both GNU and b3sum, although we could still accept the length argument.
If you all decide to do something like --bytes, it might be reasonable to add a similar option upstream on b3sum, so that there's at least something we could tell people to do that'll work (or give a clear error message) everywhere?
Current coreutils b3sum generates hash *filename, but the original implementation by @oconnor663 uses a second space instead of an asterisk and refuses to check hashes if such character is missing as follows
$ type B3SUMS
79b4c2c09d6f3ac0beee530cb69262f81ce303144620728e8c98044dd543ceb2 *out1.png
7febfd4fd14642e0f70c50dd096ec684f6d05afce5d606aeb157ec93a96890ed *out2.png
17a7f68466e2b437042a5b2ebad2b3abde27f2b83d0c57358a2cbfbb31cf05ed *out3.png
$ b3sum -c B3SUMS
b3sum: Invalid space
b3sum: WARNING: 3 computed checksums did NOT match
We need to either discuss it or bring this part of coreutils into line.
P.S. I already have a dozen BLAKE3 implementations, mostly downloaded from Github, and the output format is different between them. The original one also refuses to hash lists with CRLF line breaks, typical for Windows (such lists are generated, for example, by RapidCRC). Obviously, this UX mess creates a difficulty for replacing SHA256.
When, among partners, concord there is not.
Successful issues scarce are got
And the result is loss, disaster and repining. A crayfish, swan and pike combining.
Resolve to draw a cart and freight;
In harness soon, their efforts ne'er abate.
However much they work, the load to stir refuses.
It seems to be perverse with selfwill vast endowed;
The swan makes upward for a cloud.
The crayfish falls behind, the pike the river uses;
To judge of each one's merit lies beyond my will;
I know the cart remains there, still. — Ivan Krylov, “Crayfish, Swan and Pike” (1814),
translated from Russian by Charles Fillingham Coxwell.Click to broaden your horizons even further (and to see a SFW picture!)…