kdl
kdl copied to clipboard
Remove '#' from legal characters in a bare identifier
Based on discussion here: https://github.com/kdl-org/kdl/discussions/200
So KQL current picks out two more symbols that are currently legal identifiers: ~
and +
See https://github.com/kdl-org/kdl/blob/main/QUERY-SPEC.md#selectors
Should we use those, or reserve something else?
Do you for-see adding general regex support to string matchers? The natural operator seems like ~=
which is perhaps another argument for removing ~
. That said, I think it depends entirely on the formal grammar for KQL. If whitespace is required between identifiers and operators, then I don't think ~
/+
being legal is an issue, since as far as I can tell the token in that position can only be an operator.
EDIT: Haha just kidding, I forgot a b
was a legal (and common) selector. I suppose we do have the option of breaking the css similarity and moving descendent selectors to something like a >> b
.
+
isn't actually an issue, I don't think, since it's already not legal as the first char in a bare id.
+
isn't actually an issue, I don't think, since it's already not legal as the first char in a bare id.
oh I didn't realize that. That's right!
I suppose we do have the option of breaking the css similarity and moving descendent selectors to something like
a >> b
.
I like this a lot, but thinking about verbosity, I think it's must more straightforward to say a b
than a >> b
, esp with "real" node names? I think the >>
will become cumbersome? thoughts?
I'm also pretty in favor of removing ~
from identifiers and reserving it. If we do that, I can't think of anything else we might want to reserve that's important to.
I don't think >>
is too cumbersome, and as someone who doesn't do much complex css, it better communicates the meaning, imo.
I'm in favor of removing ~
as well.
I'm also kinda wanting to bring back /
?
if we do that, URLs would become legal identifiers in most cases? I'm trying to think of when they wouldn't be, but maybe "URL encodable" might be the target we want fo KDL 2.0?
https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid/1547940#1547940
Or maybe not. That means we would need to include []();='
, among others :(
I feel like /
is even more fraught than any of the others we've discussed. It sucks to make them force-quoted, but take
// "foo"
Is that a commented line or a node named //
?
oh right I forgot about comments haha
Should we make a new branch and make this PR against that, btw? Like a 2.0 tracking branch that has 1.0-breaking changes in it?
boom, done :)
you might need to rebase onto it tho I think I did something bad.
+
isn't actually an issue, I don't think, since it's already not legal as the first char in a bare id.
Yes it is, so long as the second character isn't a digit; the grammar just prevents confusion with numbers. It's allowed by itself as a bare-ident.
I don't think >> is too cumbersome, and as someone who doesn't do much complex css, it better communicates the meaning, imo.
Fwiw, the CSSWG (aka, me, in this case) tried to add >>
as an alternate way to spell the descendant combinator. It failed due to lack of impl interest, but I would have liked it (and ++
as the way to spell ~
).
So how about making the final list of removed characters just #
and +
?
I'm fine with that. (We could keep +
as an ident char but just remove it from initial ident chars if that would still be useful, so KQL could still tell apart +
from an ident. I don't have a strong opinion either way.)
ugh but + is so useful, esp as a first char. I wonder if there's something else we could use for "sibling"/"next"?
Also, should we pre-emptively ban |
? I feel like it smells like a bracket identifier to me. And it means we could use it for OR syntaxes?
after thinking about it, I don't know if we should remove any extra characters beyond #
. We're removing #
because it potentially really complicates some kinds of implementations, but I don't think we need to reserve anything else for the sake of the query language as long as we make one change: queries must be <matcher> <operator> [<matcher> <operator>]
. That is, we no longer do descendants with spaces, and use >>
for "descendants", as @tabatkins described.
I think that'll take care of it, and we can just merge this branch. What do y'all think?
Yeah, that sounds really good. It leaves KQL's evolution much more open without tying it so intimately to KDL's precise syntax model, and avoids restricting KDL itself unnecessarily.
(Recall that CSS's implicit descendant combinator comes from the very first version, when there weren't any combinators at all. Child/next-sibling/general-sibling are all later inventions layered on top. We don't need to repeat that.)
Re: #
in particular, a final possibility I don't think has been brought up is just disallowing an ident from starting with r#
specifically. This appears to be kdl4j's current behavior, per the conversation in #200.
It's certainly slightly clumsier than a simple blanket prohibition, but it also means we maintain maximum ident syntax without requiring excessive lookahead.
I'm wavering. It's definitely clearer and probably easier to implement with parser generators to simply ban #
, but I can absolutely see people wanting to use hashtag-like identifiers and it feels worthwhile to make that clean.
Yeah, so just outlawing the numberish or rawstringish starts are easy for hand-rolled parsers, but for the grammar itself, let's see...
// current
bare-identifier := ((identifier-char - digit - sign) identifier-char* | sign ((identifier-char - digit) identifier-char*)?) - keyword
// proposed new grammar
bare-identifier := (bare-ident-start identifier-char*) - keyword
bare-ident-start := ((identifier-char - digit) identifier-char?) - (sign digit) - "r#"
// or possibly
bare-identifier := (unambiguous-ident | numberish-ident | stringish-ident) - keyword
unambiguous-ident := (identifier-char - digit - sign - "r") identifier-char*
numberish-ident := sign ((identifier-char - digit) identifier-char*)?
stringish-ident := "r" ((identifier-char - "#") identifier-char*)?
I suspect the latter is the most compatible with parser gens, since the subtractions are single-character set subtractions, rather than sequence subtractions like the - keyword
.
Note: With https://github.com/kdl-org/kdl/pull/354, and potentially with whatever happens with #350, I think we can be much less aggressive here and just ban #
as the first character in identifiers, like we do with signs and digits already. I'll keep this open for now, but I'm thinking of just resolving things that way.
This is done in kdl-v2 now, so I'm closing this one. #
is outright illegal now.