coreutils
coreutils copied to clipboard
cksum: Reverse-engineering implicit behavior of text/binary tag/untagged
cksum has some really weird and funky implicit tags going on, see https://github.com/uutils/coreutils/pull/6256
So let's figure out what exactly cksum is doing.
$ ../gnu/src/cksum -a md5 --tag README.md # This is the tagged format:
MD5 (README.md) = add2d697731ef0facc3a56207aa03a9b
$ ../gnu/src/cksum -a md5 README.md # tagged by default:
MD5 (README.md) = add2d697731ef0facc3a56207aa03a9b
$ ../gnu/src/cksum -a md5 --text README.md # tagged+text is a problem:
../gnu/src/cksum: --text mode is only supported with --untagged
Try '../gnu/src/cksum --help' for more information.
[$? = 1]
$ ../gnu/src/cksum -a md5 --text --tag README.md # tagged+text is not a problem?!
MD5 (README.md) = add2d697731ef0facc3a56207aa03a9b
So yes, something funny is going on. Let's just brute-force all possible 1024 + 256 + 64 + 16 + 4 + 1 combinations of zero to five arguments (--binary, --text, --tag, --untagged), and visualize the behavior as a graph:
(legend: edges are marked b/t/T/U for binary/text/Tag/Untagged, and vertices are the observed behavior: E/T/A/S for Error/Tagged/UntaggedSpace/UntaggedAsterisk)
First, observe that -b/-t seems to be doing precisely what we would hope for: toggle between binary/text mode. Good!
Next, observe that --tag/--untagged seems to be the flags that have the weird behavior attached to them. In particular, the T state seems to be more that one actual state, probably differentiated along the "text-binary-axis".
Removing --untagged from the brute-force search reveals that --tag always pulls the state in the binary direction:
Removing --binary from the brute-force search reveals that --untagged always pulls the state away from E (so a binary-ish direction), but A is unreachable ("Asterisk", which indicated a binary file in the untagged format):
Hypothesis: There are three steps along the "text-binary-axis": always-binary, always-text, and binary-ish. For simplicity, let's assume the same thing along the tagged-ness-axis.
By the previous observations, --tagged implies either always-binary or binary-ish. (Probably "binary-ish".)
Ending in bU does not determine the result:
bUoutputsATbUoutputsSUbUoutputsAbUbUoutputsATUbUoutputsAUTbUoutputsS- Therefore,
Udoes not set the binary-ness to a constant, but rather depends on the tagged-ness. Huh? - Assuming that we start with "tagged-ish" and
T/Uset "always-tagged/always-untagged", this means that "tagged-ish" and "always-untagged" do not interfere with the binary-ness, but in the "always-tagged" state it sets "binary-ish". What a surprising decision! (It probably made sense at the time it was written, and is probably also why it is no longer listed in--help.)
… and that finally predicts the correct behavior without any exceptions, hooray!
A simple piece of logic, but so much pain.
End result: https://github.com/BenWiederhake/worsethanfailure_cksum/blob/master/check_model.py#L19
ah fun, exactly what I was working on yesterday. :) you will make life significantly easier