kdl Remove '#' from legal characters in a bare identifier

Based on discussion here: https://github.com/kdl-org/kdl/discussions/200

Sep 26 '21 23:09 hkolbeck

So KQL current picks out two more symbols that are currently legal identifiers: ~ and +

See https://github.com/kdl-org/kdl/blob/main/QUERY-SPEC.md#selectors

Should we use those, or reserve something else?

Sep 26 '21 23:09 zkat

Do you for-see adding general regex support to string matchers? The natural operator seems like ~= which is perhaps another argument for removing ~. That said, I think it depends entirely on the formal grammar for KQL. If whitespace is required between identifiers and operators, then I don't think ~/+ being legal is an issue, since as far as I can tell the token in that position can only be an operator.

EDIT: Haha just kidding, I forgot a b was a legal (and common) selector. I suppose we do have the option of breaking the css similarity and moving descendent selectors to something like a >> b.

Sep 27 '21 00:09 hkolbeck

+ isn't actually an issue, I don't think, since it's already not legal as the first char in a bare id.

Sep 27 '21 00:09 hkolbeck

+ isn't actually an issue, I don't think, since it's already not legal as the first char in a bare id.

oh I didn't realize that. That's right!

I suppose we do have the option of breaking the css similarity and moving descendent selectors to something like a >> b.

I like this a lot, but thinking about verbosity, I think it's must more straightforward to say a b than a >> b, esp with "real" node names? I think the >> will become cumbersome? thoughts?

I'm also pretty in favor of removing ~ from identifiers and reserving it. If we do that, I can't think of anything else we might want to reserve that's important to.

Sep 27 '21 01:09 zkat

I don't think >> is too cumbersome, and as someone who doesn't do much complex css, it better communicates the meaning, imo.

I'm in favor of removing ~ as well.

Sep 27 '21 02:09 hkolbeck

I'm also kinda wanting to bring back /?

Sep 28 '21 00:09 zkat

if we do that, URLs would become legal identifiers in most cases? I'm trying to think of when they wouldn't be, but maybe "URL encodable" might be the target we want fo KDL 2.0?

Sep 28 '21 00:09 zkat

https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid/1547940#1547940

Or maybe not. That means we would need to include []();=', among others :(

Sep 28 '21 00:09 zkat

I feel like / is even more fraught than any of the others we've discussed. It sucks to make them force-quoted, but take

// "foo"

Is that a commented line or a node named //?

Sep 28 '21 00:09 hkolbeck

oh right I forgot about comments haha

Sep 28 '21 00:09 zkat

Should we make a new branch and make this PR against that, btw? Like a 2.0 tracking branch that has 1.0-breaking changes in it?

Sep 28 '21 00:09 zkat

boom, done :)

you might need to rebase onto it tho I think I did something bad.

Sep 28 '21 00:09 zkat

+ isn't actually an issue, I don't think, since it's already not legal as the first char in a bare id.

Yes it is, so long as the second character isn't a digit; the grammar just prevents confusion with numbers. It's allowed by itself as a bare-ident.

Oct 05 '21 16:10 tabatkins

I don't think >> is too cumbersome, and as someone who doesn't do much complex css, it better communicates the meaning, imo.

Fwiw, the CSSWG (aka, me, in this case) tried to add >> as an alternate way to spell the descendant combinator. It failed due to lack of impl interest, but I would have liked it (and ++ as the way to spell ~).

Oct 05 '21 16:10 tabatkins

So how about making the final list of removed characters just # and +?

Oct 05 '21 17:10 zkat

I'm fine with that. (We could keep + as an ident char but just remove it from initial ident chars if that would still be useful, so KQL could still tell apart + from an ident. I don't have a strong opinion either way.)

Oct 05 '21 18:10 tabatkins

ugh but + is so useful, esp as a first char. I wonder if there's something else we could use for "sibling"/"next"?

Also, should we pre-emptively ban |? I feel like it smells like a bracket identifier to me. And it means we could use it for OR syntaxes?

Oct 05 '21 18:10 zkat

after thinking about it, I don't know if we should remove any extra characters beyond #. We're removing # because it potentially really complicates some kinds of implementations, but I don't think we need to reserve anything else for the sake of the query language as long as we make one change: queries must be <matcher> <operator> [<matcher> <operator>]. That is, we no longer do descendants with spaces, and use >> for "descendants", as @tabatkins described.

I think that'll take care of it, and we can just merge this branch. What do y'all think?

Oct 06 '21 03:10 zkat

Yeah, that sounds really good. It leaves KQL's evolution much more open without tying it so intimately to KDL's precise syntax model, and avoids restricting KDL itself unnecessarily.

(Recall that CSS's implicit descendant combinator comes from the very first version, when there weren't any combinators at all. Child/next-sibling/general-sibling are all later inventions layered on top. We don't need to repeat that.)

Oct 06 '21 05:10 tabatkins

Re: # in particular, a final possibility I don't think has been brought up is just disallowing an ident from starting with r# specifically. This appears to be kdl4j's current behavior, per the conversation in #200.

It's certainly slightly clumsier than a simple blanket prohibition, but it also means we maintain maximum ident syntax without requiring excessive lookahead.

Oct 06 '21 06:10 tabatkins

I'm wavering. It's definitely clearer and probably easier to implement with parser generators to simply ban #, but I can absolutely see people wanting to use hashtag-like identifiers and it feels worthwhile to make that clean.

Oct 06 '21 20:10 hkolbeck

Yeah, so just outlawing the numberish or rawstringish starts are easy for hand-rolled parsers, but for the grammar itself, let's see...

// current
bare-identifier := ((identifier-char - digit - sign) identifier-char* | sign ((identifier-char - digit) identifier-char*)?) - keyword

// proposed new grammar
bare-identifier := (bare-ident-start identifier-char*) - keyword
bare-ident-start := ((identifier-char - digit) identifier-char?) - (sign digit) - "r#"

// or possibly
bare-identifier := (unambiguous-ident | numberish-ident | stringish-ident) - keyword
unambiguous-ident := (identifier-char - digit - sign - "r") identifier-char*
numberish-ident := sign ((identifier-char - digit) identifier-char*)?
stringish-ident := "r" ((identifier-char - "#") identifier-char*)?

I suspect the latter is the most compatible with parser gens, since the subtractions are single-character set subtractions, rather than sequence subtractions like the - keyword.

Oct 06 '21 22:10 tabatkins

Note: With https://github.com/kdl-org/kdl/pull/354, and potentially with whatever happens with #350, I think we can be much less aggressive here and just ban # as the first character in identifiers, like we do with signs and digits already. I'll keep this open for now, but I'm thinking of just resolving things that way.

Dec 11 '23 16:12 zkat

This is done in kdl-v2 now, so I'm closing this one. # is outright illegal now.

Dec 13 '23 07:12 zkat

kdl kdl copied to clipboard

Remove '#' from legal characters in a bare identifier

kdl
kdl copied to clipboard