coreutils cksum: failing GNU tests

The latest version of GNU's tests add tests/misc/cksum, tests/misc/cksum-a, tests/misc/cksum-c, and tests/misc/sm3sum, all of which currently fail. This will track the respective problems from each.

Jan 27 '22 18:01 jtracey

uu_cksum correctly implements cksum 8.30, which only has --help and --version. However, the GNU Coreutils 9.0 docs for cksum list the following options:

[x] -a/--algorithm
[ ] --debug
[x] --untagged
All options supported by b2sum:
- [ ] -l/--length
- All options supported by md5sum:
  - ~~-b/--binary~~ (unsupported for cksum, always on)
  - [x] -c/--check
    - [ ] cksum auto-detects the algorithm if given data in the --tag format.
    - [ ] cksum does not support --check with -a sysv, -a bsd, or -a crc.
  - [ ] --ignore-missing
  - [x] --quiet
  - [x] --status
  - [x] --tag
    - [ ] cksum uses this as its default output format (use --untagged for old format).
  - ~~-t/--text~~ (unsupported for cksum, always off)
  - [x] -w/--warn
  - [x] --strict
  - [x] -z/--zero

So uu_cksum is correct for Coreutils 8.30 compatibility, but for Coreutils 9.0 compatibility, we need to flesh it out.

Jan 29 '22 17:01 str4d

Investigating the codebase a bit further, I see that both md5sum and b2sum are implemented via hashsum. This suggests we should make cksum also be implemented by hashsum. I've checked off above all the options currently implemented by hashsum.

Jan 29 '22 17:01 str4d

Looks like test/misc/b2sum is also failing, both because hashsum doesn't support -l/--length, and cksum doesn't support -a blake2b.

I'm going to try implementing this.

Jan 29 '22 17:01 str4d

I made a bunch of progress on this and then ran out of time to finish the PRs. Picking this back up now; I'll rebase my current local work and get some initial PRs up.

Jun 04 '22 22:06 str4d

Ugh, this is annoying: b3sum was added in https://github.com/uutils/coreutils/pull/3108, and its corresponding upstream binary defines -l/--length in bytes, whereas GNU Coreutils b2sum defines -l/--length in bits.

There are three options here:

Define -l/--length as bytes to match upstream b3sum. Given that compatibility with GNU Coreutils is being prioritised (per this issue), I think that would not be considered acceptable as b2sum would then never pass GNU tests.
Define -l/--length as bits to match GNU Coreutils. I really dislike this option, for two reasons:
- --length being bits means it needs to be a multiple of 8, so 7/8 of values that upstream b3sum accepts will be rejected here. That's not so bad, it makes the difference discoverable.
- In the cases where an upstream length is a multiple of 8, b3sum here would output 1/8 of the length, e.g. instead of a 16-byte hash the user gets a 16-bit hash (2 bytes). This is a TERRIBLE silent security issue.
Define -l/--length as bits when used with b2sum or cksum, and bytes when used with b3sum. This is awful for UX; in particular, it is undefined how hashsum should treat it. The least-bad option might be to just special-case the byte interpretation to b3sum, and not allow it to be used with -a blake3 on hashsum (otherwise hashsum would almost certainly end up conflicting with cksum if the latter adds BLAKE3 support).

We have a somewhat-similar issue in that hashsum itself defines a separate --bits flag for its SHA3 and SHAKE support, which is effectively duplicative of GNU Coreutils -l/--length. I'm not going to try solve that here.

Jun 05 '22 00:06 str4d

cc @oconnor663 who wrote b3sum

Jun 05 '22 00:06 str4d

While I really want GNU's compatibility, we workaround with upstream testsuite from time to time. See https://github.com/uutils/coreutils/blob/main/util/build-gnu.sh#L121= and below

Jun 05 '22 07:06 sylvestre

Note that the suprising behaviour you mentioned in option 2 not only happens between b3sum implementations but also between different hashing algorithms. I.e. if I use the external b3sum first and then (GNU) b2sum, then the length does not match either. Hence, I think we bring back the consistency and should build what a hypothetical GNU b3sum would look like, not what another b3sum already does, to ensure compatibility/consistency with GNU and with the other utils in uutils.

I like the first option too, but it would indeed break GNU compatibility, so like Sylvestre mentioned, we'd have to work around the tests. Not to mention that it could be surprising for GNU users (although we'd err on the secure side).

Jun 05 '22 07:06 tertsdiepraam

Yeah, the inconsistency blows. Here are some scattered thoughts about why it ended up the way it is:

Unfortunately, different BLAKE2 interfaces have been inconsistent for a long time. CLIs and written naming conventions tend to use bits, but Python's hashlib.blake2b API takes the digest_size parameter in bytes. Similarly, C APIs usually take a size_t somewhere that counts bytes.
Measuring BLAKE2 output in bits makes sense from a security perspective. The security properties of BLAKE2b-512 and BLAKE2b-256 mirror those of SHA-512 and SHA-256. However, this relationship relies somewhat on BLAKE2b not allowing output sizes larger than its internal state size. For example, b2sum -l 1024 will just error out. The same doesn't apply to BLAKE3, though, which has a built-in XOF that'll give you as many output bytes as you like. "BLAKE3-512" (if we were to call it that) does not provide more security than the default 256-bit BLAKE3. I find this is a common source of confusion, so I discourage naming BLAKE3 instances in terms of bits, even though it can make sense on the small end.
Another way BLAKE2 is similar to SHA-2 is that BLAKE2b-512 and BLAKE2b-256 are independent ("domain separated") hash functions. Changing the output length gives you a completely unrecognizable hash. This is a natural design for those functions, since they can mix the output length into state initialization one way or another. But for an XOF like BLAKE3, it's pretty common for the output length to be "who knows, we're gonna use this as a stream cipher", so domain-separating outputs of different lengths isn't always possible. Rather than having two modes for this, as BLAKE2X did, in BLAKE3 shorter outputs are always prefixes of longer ones. This is another reason why I discourage bit-counting names like "BLAKE3-512", to avoid implying that variants with different names are independent of each other.

Which is all to say, that when there's a choice between counting bits and counting bytes, BLAKE3 pretty strongly prefers to count bytes. It keeps CLIs and library APIs consistent with each other, and it avoids some common points of confusion. But sometimes there isn't a choice, like our situation here, and then we suffer some inconsistency.

Jun 05 '22 18:06 oconnor663

Interesting! That does make it clearer as to why b3sum is different. Maybe a solution could be to make a split API. So b2sum (and the other ones GNU supports) would accept a --bits option and -l would work but be deprecated/hidden. b3sum would accept a --bytes option and using the wrong argument would result in an error. However, a downside to this is that we break compat with both GNU and b3sum, although we could still accept the length argument.

Jun 05 '22 21:06 tertsdiepraam

If you all decide to do something like --bytes, it might be reasonable to add a similar option upstream on b3sum, so that there's at least something we could tell people to do that'll work (or give a clear error message) everywhere?

Jun 05 '22 21:06 oconnor663

Current coreutils b3sum generates hash *filename, but the original implementation by @oconnor663 uses a second space instead of an asterisk and refuses to check hashes if such character is missing as follows

$ type B3SUMS
79b4c2c09d6f3ac0beee530cb69262f81ce303144620728e8c98044dd543ceb2 *out1.png
7febfd4fd14642e0f70c50dd096ec684f6d05afce5d606aeb157ec93a96890ed *out2.png
17a7f68466e2b437042a5b2ebad2b3abde27f2b83d0c57358a2cbfbb31cf05ed *out3.png

$ b3sum -c B3SUMS
b3sum: Invalid space
b3sum: WARNING: 3 computed checksums did NOT match

We need to either discuss it or bring this part of coreutils into line.

P.S. I already have a dozen BLAKE3 implementations, mostly downloaded from Github, and the output format is different between them. The original one also refuses to hash lists with CRLF line breaks, typical for Windows (such lists are generated, for example, by RapidCRC). Obviously, this UX mess creates a difficulty for replacing SHA256.

Click to broaden your horizons even further (and to see a SFW picture!)…

2024-0105-0309 crayfish swan pike

When, among partners, concord there is not. Successful issues scarce are got And the result is loss, disaster and repining.

A crayfish, swan and pike combining. Resolve to draw a cart and freight; In harness soon, their efforts ne'er abate. However much they work, the load to stir refuses. It seems to be perverse with selfwill vast endowed; The swan makes upward for a cloud. The crayfish falls behind, the pike the river uses; To judge of each one's merit lies beyond my will; I know the cart remains there, still.

— Ivan Krylov, “Crayfish, Swan and Pike” (1814), translated from Russian by Charles Fillingham Coxwell.

Jan 04 '24 23:01 sergeevabc