kdl icon indicating copy to clipboard operation
kdl copied to clipboard

v2.0 additional restricted literal characters

Open tabatkins opened this issue 3 years ago • 3 comments

Currently, idents disallow a few characters from being expressed literally, requiring they be escaped if authors want to include them:

  • codepoints < 0x20 (control characters)
  • codepoints > 0x10FFFF (invalid codepoints)
  • some ASCII characters reserved for syntax reasons

I think there's a few more we can reasonably restrict to make KDL documents more readable/understandable:

Removing 0x7F just seems like fixing an omission; it's easy to forget that the ASCII control characters aren't contiguous.

Removing the direction-control characters helps keep KDL source readable; the direction override characters in particular are somewhat fraught to show up in plain-text documents, as they can corrupt the display of following text in the wrong direction (as demonstrated in the recent somewhat-hyperbolic complaints about them showing up in Rust and other source languages as a possible review-attack). If these character are desired for use in text values, such as strings, they can still be escaped; their literal usage in what is otherwise an ASCII-based language is virtually always either accidental or malicious, since they're intended for text formatting and have no semantic meaning.

The BOM is allowed at the start of a KDL document

(A previous issue suggested restricting the surrogate-pair characters as well (0xD800-DFFF); these are already restricted implicitly by the requirement that KDL documents be encoded in UTF-8, where such codepoints can't be validly encoded. As such I'm continuing to omit them from these suggestions.)

While there are still a number of "invisible" characters in Unicode that could potentially be confusing or accidental, they also have semantic uses, so I don't currently recommend restricting them.

tabatkins avatar Nov 04 '21 23:11 tabatkins

I think you should not remove direction characters as they make it impossible to literally encode bidirectional strings, which is important for internationalization. It is true that BIDI control characters can create review‐attacks, but KDL is not a programming language and the probability of someone it needing to encode lengthy strings (which may include bidirectional text) is pretty high. Linters and formatters can be used by individual projects to detect and warn about the use of these characters if needed; there is no reason to forbid them at a language level.

I would suggest disallowing exactly the same characters as RestrictedChar in XML 1.1, plus U+0000, U+FFFE, and U+FFFF (which are not allowed to be escaped in XML either).

marrus-sh avatar Sep 05 '22 05:09 marrus-sh

I think you should not remove direction characters as they make it impossible to literally encode bidirectional strings, which is important for internationalization.

To be clear, the proposal is only to remove them from identifiers like node. There's nothing stopping someone from using them (in either literal or escaped form) in a quoted string.

Lucretiel avatar Sep 07 '22 19:09 Lucretiel

Well, my post was unclear; I talked about the ident restrictions at first, but then later mentioned being able to include them in strings via escapes.

But yeah, I think just talking about idents is fine. (Notably, you can't escape anything in raw strings, which would be somewhat limiting.)

tabatkins avatar Sep 07 '22 20:09 tabatkins

These changes have been merged into the kdl-v2 branch

zkat avatar Dec 13 '23 05:12 zkat