kdl number suffix type annotations

Fixes: https://github.com/kdl-org/kdl/issues/510

Mar 30 '25 23:03 zkat

@tabatkins honestly the most heartbreaking thing about all this to me is that I can't have 123u32 and such :( Gotta do 123#u32

Mar 30 '25 23:03 zkat

@bgotink does this align with your understanding/experience/what you did in your lib?

Apr 01 '25 20:04 zkat

alright, tests added. This is ready for final review.

Apr 02 '25 21:04 zkat

The following tests that previously failed now run successfully:

test name	test	equivalent document
`bare_ident_numeric_fail.kdl`	`node 0n`	`node (n)0`
`bare_ident_numeric_sign_fail.kdl`	`node +0n`	`node (n)+0`
`illegal_char_in_binary_fail.kdl`	`node 0bx01`	`node (bx01)0`
`multiple_x_in_hex_fail.kdl`	`node 0xx10`	`node (xx10)0`
`no_digits_in_hex_fail.kdl`	`node 0x`	`node (x)0`

Apr 05 '25 19:04 bgotink

Uggghhhhh. That makes sense. Looks like we’re gonna need to be more specific what order things run in. I have an idea for the grammar.

Apr 07 '25 17:04 zkat

@bgotink

The following tests that previously failed now run successfully:

Looking at this...

These should be dropped because they're ok now:

test name test equivalent document bare_ident_numeric_fail.kdl node 0n node (n)0 bare_ident_numeric_sign_fail.kdl node +0n node (n)+0

These tests should stay and continue to fail. Once you've hit a 0b/0o/0x number prefix, you SHOULD only be able to parse their related number formats:

test name test equivalent document illegal_char_in_binary_fail.kdl node 0bx01 node (bx01)0 multiple_x_in_hex_fail.kdl node 0xx10 node (xx10)0 no_digits_in_hex_fail.kdl node 0x node (x)0

Apr 17 '25 18:04 zkat

uggghhhh I understand why those last 3 tests pass now. I want to do something about it, though. I don't like that. I wonder if there's a good way around it.

Apr 17 '25 20:04 zkat

@bgotink @tabatkins I've... kind of changed the rules a bit. They're simpler now. And most importantly: we have 0u64 available now!

I checked #510 and I didn't see any discussion about us tackling these rules, because we were focusing so much on what the rules for the suffix should be that we didn't take a step back and think of these numbers as a whole, and how the number syntax in general could be addressed.

Please lmk what you think. I think with these changes, we'll get the results I expected in https://github.com/kdl-org/kdl/pull/513#issuecomment-2813782924

Apr 17 '25 21:04 zkat

Some key changes:

I reorganized things a bit in general.
The grammar itself now blocks simultaneous suffix and prefix annotations
The underscore _ initial character restriction was removed: a bare suffix CAN'T start with one because the integer will slurp it up first
The complex rules meant to disambiguate from non-decimals were removed. The fact that we only allow bare prefixes on decimals is sufficient to protect us here imo. I think these complex rules formed because things were moving fast and we didn't take a step back and really rethink the implications of only doing decimals. This has been clarified in the spec. With the new rules, there is no dangerous ambiguity, just potential syntax errors on certain zero values, which is notably NOT an unexpected parsing success, which I think is the dangerous bit.
The rules around exponential-likes have been changed a bit to guard against small typos if you miss the digit part (so 1e+ is illegal)

Apr 18 '25 05:04 zkat

I'm also wondering: could we just drop the exponent restriction (but keep the e+ protection, in case someone fails to write that digit and ends up with an unfortunate parse)

Apr 18 '25 06:04 zkat

I don't think the specification prose as currently written does require or even allow a trailing _ to be consumed by the number ("digits ... may be separated by _"), but the authoritative grammar does include arbitrary underscores in place of digits anywhere except the beginning.

I think either the description of the grammatical language needs tightening, or the grammar may now allow some undesirable constructions with underscores and suffixes. Specifically, I'm not sure whether * is expected to commit a parser to whatever it finds the first time, or whether it can backtrack to allow later productions to match. Here, it would be backtracking to an underscore after the input didn't match when the underscore was consumed as part of a number.

Consider 12_3,x. The overall match would fail when (digit | '_')* from the integer production consumes _3; does it then backtrack and try consuming less of the input? If it stops short after "12" and leaves "_3,x", the whole input matches suffixed-decimal successfully with significand accepting "12" and bare-type-suffix accepting "_3,x" — so the result would be equivalent to (_3,x)12, though I think it should be an error. 12_3.4 and 12_3.4.5 differ at the same point, or consider 1_234,567. All of these can be produced from the grammar, at least.

I'm not sure whether I am misreading the description of the grammar language. * consumes "as many instances as possible without failing the match" — is that failing the match of the whole input, or does it just mean as many instances as are present at this point in the input and failure isn't really part of it? The comparison to standard regex semantics and the existence of cut points makes me think it can shorten if needed. If it does commit early to the longest sequence it finds, this issue doesn't come up. Otherwise, for integer to definitively slurp the underscore up and make this an error, I think there would need to be a cut point suffixed-decimal := significand ¶ (bare-type-suffix | (exponent? explicit-type-suffix)).

If there is an issue, either banning _ as the initial character of a suffix again or solidifying the grammatical handling of integers could address it. I am not personally a fan of baking parsing and backtracking rules into a grammar and would probably just block _, but there are reasons to have that kind of grammar too, particularly to constrain where an error is detected.

Something shaped like 1_234,567 seems like the primary case where this is realistic and actually matters, or someone trying to write a list 1_234, 5_678, .... I do think the grammar should rule these out, but if I were implementing a parser directly off this grammar right now, I would end up with this backtracking and unknowingly accepting these cases because nothing commits the parse to the path that produces an error. If I built the parser with a lexer in the front, I think the lexer would probably consume the whole number as expected and then I wouldn't encounter the issue. Clearly one of those is wrong, but it's not good for compatibility if both seem reasonable to make.

If nothing else, test cases including both underscores within the number and invalid suffixes will be good to ensure that the incorrect readings are detected.

I also wonder in a similar way about 0xaz. The prose does rule this out. This time it's the grammar semantics of - in significand-initial I'm not certain about: "any digit except something that matches the literal '0x'" seems like it'd be the same as just digit alone, because "0x" is not a digit and is not matched by digit. The intention here is clear, just the formalism may not match up.

Apr 19 '25 00:04 mwh

Consider 12_3,x [snip discussion about why this successfully matches the current grammar]

Yup, the _ prefix restriction was actually load-bearing unless we introduce more cut-points, which I'd like to avoid except when necessary. (They're easy to forget about and then accidentally have side-effects.) We need to bring it back for the grammar (even if it might not be worth calling out in the prose, tho I actually do think some discussion about prefixes that won't work as expected due to parsing rules is worthwhile in the prose).

Apr 28 '25 19:04 tabatkins

I think these complex rules formed because things were moving fast and we didn't take a step back and really rethink the implications of only doing decimals. This has been clarified in the spec. With the new rules, there is no dangerous ambiguity, just potential syntax errors on certain zero values, which is notably NOT an unexpected parsing success, which I think is the dangerous bit.

This isn't correct, there's still an unexpected parsing success in some cases. See 20x1, for example - totally valid under these rules (as expected), equivalent to (x1)20. But if the value becomes 0, it misparses - 0x1 parses equivalent to 1, not (x1)0. That's the scenario we were afraid of, and the reason those restrictions existed.

If the part following the x doesn't look like hex digits (20xfoo), then sure, when the value is 0 it'll just be a parse failure rather than a misparse. (We were somewhat worried about that as a usability issue, but I agree it can be a reasonable bullet to bite.) But the misparse when the following does look like hex digits is the real problem.

Apr 28 '25 19:04 tabatkins