coreutils icon indicating copy to clipboard operation
coreutils copied to clipboard

cksum: Reverse-engineering implicit behavior of text/binary tag/untagged

Open BenWiederhake opened this issue 1 year ago • 1 comments

cksum has some really weird and funky implicit tags going on, see https://github.com/uutils/coreutils/pull/6256

So let's figure out what exactly cksum is doing.

$ ../gnu/src/cksum -a md5 --tag README.md # This is the tagged format:
MD5 (README.md) = add2d697731ef0facc3a56207aa03a9b
$ ../gnu/src/cksum -a md5 README.md # tagged by default:
MD5 (README.md) = add2d697731ef0facc3a56207aa03a9b
$ ../gnu/src/cksum -a md5 --text README.md # tagged+text is a problem:
../gnu/src/cksum: --text mode is only supported with --untagged
Try '../gnu/src/cksum --help' for more information.
[$? = 1]
$ ../gnu/src/cksum -a md5 --text --tag README.md # tagged+text is not a problem?!
MD5 (README.md) = add2d697731ef0facc3a56207aa03a9b

So yes, something funny is going on. Let's just brute-force all possible 1024 + 256 + 64 + 16 + 4 + 1 combinations of zero to five arguments (--binary, --text, --tag, --untagged), and visualize the behavior as a graph: general_nondet_graph

(legend: edges are marked b/t/T/U for binary/text/Tag/Untagged, and vertices are the observed behavior: E/T/A/S for Error/Tagged/UntaggedSpace/UntaggedAsterisk)

First, observe that -b/-t seems to be doing precisely what we would hope for: toggle between binary/text mode. Good!

Next, observe that --tag/--untagged seems to be the flags that have the weird behavior attached to them. In particular, the T state seems to be more that one actual state, probably differentiated along the "text-binary-axis".

Removing --untagged from the brute-force search reveals that --tag always pulls the state in the binary direction: nountagged_nondet_graph

Removing --binary from the brute-force search reveals that --untagged always pulls the state away from E (so a binary-ish direction), but A is unreachable ("Asterisk", which indicated a binary file in the untagged format): nobinary_nondet_graph

Hypothesis: There are three steps along the "text-binary-axis": always-binary, always-text, and binary-ish. For simplicity, let's assume the same thing along the tagged-ness-axis.

By the previous observations, --tagged implies either always-binary or binary-ish. (Probably "binary-ish".)

Ending in bU does not determine the result:

  • bU outputs A
  • TbU outputs S
  • UbU outputs A
  • bUbU outputs A
  • TUbU outputs A
  • UTbU outputs S
  • Therefore, U does not set the binary-ness to a constant, but rather depends on the tagged-ness. Huh?
  • Assuming that we start with "tagged-ish" and T/U set "always-tagged/always-untagged", this means that "tagged-ish" and "always-untagged" do not interfere with the binary-ness, but in the "always-tagged" state it sets "binary-ish". What a surprising decision! (It probably made sense at the time it was written, and is probably also why it is no longer listed in --help.)

… and that finally predicts the correct behavior without any exceptions, hooray!

A simple piece of logic, but so much pain.

End result: https://github.com/BenWiederhake/worsethanfailure_cksum/blob/master/check_model.py#L19

BenWiederhake avatar May 06 '24 00:05 BenWiederhake

ah fun, exactly what I was working on yesterday. :) you will make life significantly easier

sylvestre avatar May 06 '24 08:05 sylvestre