kdl Merge KDL v2

Here it is! The long-awaited KDL v2, which is where we go ahead and make a handful of technically-breaking changes to address some corner cases we've run into over the past year while KDL has been getting implemented in a bunch of languages by various people.

I'd love to get feedback on what we have slated, and whether there's anything else we should definitely include when this goes out.

Aug 28 '22 20:08 zkat

/cc @CAD97

Aug 28 '22 20:08 zkat

I have a slight preference for #241 over #204 personally, though only slight.

Aug 28 '22 20:08 CAD97

I have a preference for #204, because the primary use case I can see for # in bare identifiers is hashtag-like which would be illegal under either, and it seems better to go with the simpler rule.

That preference is not terribly strong, though.

Edit: I misread, I'm fine with either

Aug 29 '22 15:08 hkolbeck

the primary use case I can see for # in bare identifiers is hashtag-like

To clarify, #241 allows #ident as a bare ident, and both will of course still allow "r#ident" as a quoted ident.

Argument for allowing: transliterating CSS selectors, for e.g. CSS-in-KDL. Argument against allowing: using the syntax in KQL as a selector like CSS.

Aug 29 '22 19:08 CAD97

Argument for allowing: transliterating CSS selectors, for e.g. CSS-in-KDL. Argument against allowing: using the syntax in KQL as a selector like CSS.

#foo in CSS is special-casing the id attribute. KQL doesn't have an equivalent to HTML's id, and using #foo syntax in KQL to mean something else might be confusing given its meaning in CSS, so I don't find the argument against compelling.

My inclination is to prefer #241 as well, as I think being able to write hashtags is neat. It also allows for doing things like writing Nix flake references as bare words, e.g. nixpkgs#hello.

Aug 31 '22 06:08 lilyball

Can we squeeze https://github.com/kdl-org/kdl/issues/213 into this? The specific proposal is the addition of escaped whitespace in string literals– that \, followed by literal (non-escaped) whitespace, should consume and discard all that whitespace. This is a slight simplification of the Rust rule, which specifically requires that \ be followed by \n.

Aug 31 '22 17:08 Lucretiel

I'm also a fan of #213, though it seems like there's some ambiguity in the discussion. Namely, does

- "x\
    y\
    z"

Translate to "xyz", "x y z", or "x\ny\nz"?

Aug 31 '22 17:08 hkolbeck

Can we squeeze #213 into this? The specific proposal is the addition of escaped whitespace in string literals– that \, followed by literal (non-escaped) whitespace, should consume and discard all that whitespace. This is a slight simplification of the Rust rule, which specifically requires that \ be followed by \n.

@Lucretiel do you have time to put together a PR with this grammar+prose change? I'm game.

Aug 31 '22 17:08 zkat

Can we squeeze #213 into this? The specific proposal is the addition of escaped whitespace in string literals– that \, followed by literal (non-escaped) whitespace, should consume and discard all that whitespace. This is a slight simplification of the Rust rule, which specifically requires that \ be followed by \n.

@Lucretiel do you have time to put together a PR with this grammar+prose change? I'm game.

Yes, tonight I can put that together :) should it be in the form of an amendment to SPEC.md?

Aug 31 '22 20:08 Lucretiel

I'm also a fan of #213, though it seems like there's some ambiguity in the discussion. Namely, does
- "x\
    y\
    z"
Translate to "xyz", "x y z", or "x\ny\nz"?

I agree there's some ambiguity in the original. That example would translate to "xyz", because all literal whitespace after the \ is consumed and discarded. If you want to retain whitespace, it should either come before the \ or itself be escaped. I think my comment (https://github.com/kdl-org/kdl/issues/213#issuecomment-929869117) succinctly describes this.

Aug 31 '22 20:08 Lucretiel

Can we squeeze #213 into this? The specific proposal is the addition of escaped whitespace in string literals– that \, followed by literal (non-escaped) whitespace, should consume and discard all that whitespace. This is a slight simplification of the Rust rule, which specifically requires that \ be followed by \n.

@Lucretiel do you have time to put together a PR with this grammar+prose change? I'm game.

Yes, tonight I can put that together :) should it be in the form of an amendment to SPEC.md?

yep!

I agree there's some ambiguity in the original. That example would translate to "xyz", because all literal whitespace after the \ is consumed and discarded. If you want to retain whitespace, it should either come before the \ or itself be escaped. I think my comment (https://github.com/kdl-org/kdl/issues/213#issuecomment-929869117) succinctly describes this.

Is this what Rust does? I would've expected that to at least preserve the first newline. Then again, this is consistent with KDL's existing escline rule where \<newline> is the same as <non-newline whitespace>

Aug 31 '22 21:08 zkat

Is this what Rust does?

[playground]

[src/main.rs:2] dbg!("\
    here\
    is\
    an\
    example\
    ") = "hereisanexample"

Aug 31 '22 21:08 CAD97

It's worth noting that bash behaves similarly as far as just dropping the newline, though it doesn't consume space afterward:

❯ echo foo\
… ❯ bar\
… ❯ baz
foobarbaz

With that I think xyz is the right output, and am +1 on including it in v2

Edit: Scratch that, I'm a space cadet:

❯ echo foo\
      bar\
      baz
foo bar baz

I'm more prone to emulating bash over rust, but I'm curious how others feel

Aug 31 '22 22:08 hkolbeck

Bash's behavior is concerned with syntactic whitespace (ie, allowing commands to spread over multiple lines with line continuations). It doesn't meaningfully behave in terms of consuming or not consuming specific whitespace so much as it extends a line to the next line while retaining the separation of tokens for a command. In your echo example, all that's happened is that the foo and bar and baz have correctly been passed as different arguments to echo; it's no different than:

> echo foo             bar \
   baz
foo bar baz

Kaydle has basically the same behavior with its own line continuation syntax, where you can use a \ to continue a single node into the next line. All these nodes are the same:

node 1 2 3
node 1    2   3
node 1\
  2\
  3

#213 is instead concerned with treatment of escaped whitespace in strings, where I think the plain consumption of unescaped whitespace makes the most sense

Is this what Rust does? I would've expected that to at least preserve the first newline. Then again, this is consistent with KDL's existing escline rule where <newline> is the same as

Rust does just consume all whitespace, regardless of type. The canonical way to add newlines to a whitespace-escaped string to to escape them:

assert_eq!(
    "line 1\n\
    line 2\n\
    line 3\n",

"line 1
line 2
line 3
"
);

Though more commonly I use it to stretch out long sentences with simple spaces:

assert_eq!(
    "This is a sentence with a \
    lot of words in it.",
    "This is a sentence with a lot of words in it."
);

Aug 31 '22 23:08 Lucretiel

That makes sense, and the distinction is certainly important. Thanks for the complete writeup.

Sep 01 '22 00:09 hkolbeck

Adding escaped whitespace note to the changelog: https://github.com/kdl-org/kdl/pull/291

Sep 01 '22 15:09 hkolbeck

Nudging the thread because I've added https://github.com/kdl-org/kdl/issues/250 to the bucket of things we should probably discuss for 2.0

Sep 05 '22 06:09 zkat

Is https://github.com/kdl-org/kdl/discussions/177 worth including in discussions here?

Sep 19 '22 01:09 hkolbeck

yeah, probably. Although I'm inclined towards having foo and ("")foo be distinct values. I'm kinda iffy on the special case here.

Sep 19 '22 03:09 zkat

I'm very split, on the one hand it is a special case and I'm very averse, but as @Patitotective noted this it would force implementations to distinguish between the two, which would lead to a more complex API in some languages (JS is top of mind). In addition, I just don't see a use case for blank type annotations given that impls are free to define their own.

I'm not really trying to go either direction here, just laying out thoughts.

Sep 19 '22 04:09 hkolbeck

Why would it be hard in JS? Can't JS just use null versus ""?

I haven't really been following this and I have no experience with type annotations in KDL but my initial reaction here is that specifying "" is potentially semantically distinct from not specifying a type annotation.

I'm curious what languages are actually expected to have a problem here? Languages without proper optional support tend to have some concept of null. Even in Go I'd expect that you could use a pointer-to-string in order to have nil.

That said, I've been using languages with proper optional support for long enough that I'm not sure how much of an ergonomic problem it would be to require folks to handle null type annotations in languages like Go or JS.

Sep 19 '22 04:09 lilyball

JS gives you null, but empty strings are falsey. It's for sure a small thing, it just means instead of writing:

if (val.annotation) {
   ...
}

You have to write:

if (val.annotation === null) {
    ...
}

I would count that as a more complex API.

Sep 19 '22 15:09 hkolbeck

For most statically typed languages (like Nim) you have to use Option types which distinguish between no value and empty value. This would make APIs more complex and seems pointless to me, ("")node should be the same as node.

Sep 19 '22 15:09 Patitotective

I'm pro making them distinct values. Packages that work with the CST to provide an API for modifying KDL text while retaining comments and formatting would be more complex and likely inconsistent without the distinction.

If an empty or an absent annotation is considered the same, these packages would need to track that. Even if they then map empty and absent annotations onto the same public value, it would lead to confusing behaviour in the locations the different CST nodes:

node /* comment */ ()null
//                |         < end of the leading whitespace for the `null` value
//                   |      < start of the `null` value
//                 __       < missing locations

Making the empty annotation () part of the leading whitespace would be wrong cf. the language specification.

Imo the best solution for these packages is to expose the difference between an empty and an absent annotation: consistent CST node locations and no () as part of whitespace.

Sep 19 '22 17:09 bgotink

One minor point to @bgotink's example, which I think is a good point, is the ()val does not agree with the spec by my reading, it has to be ("")val

Sep 19 '22 19:09 hkolbeck

I agree with @larsgw here in that if this is considered a problem to solve, the better solution would be to forbid zero-length identifiers rather than do some magic to make ("") equivalent to no type annotation. Given that type annotations are not given any meaning[^1] by the specification, it's fine imho for an implementation to treat a present-but-zero-length type annotation equivalently to no type annotation.

[^1]: > KDL does not specify any restrictions on what implementations might do with these annotations. They are free to ignore them, or use them to make decisions about how to interpret a value.

So my vote here is no change.

Sep 19 '22 20:09 CAD97

I agree on considering this a problem and forbidding zero-length identifiers. This change would convert valid tests like blank_node_type.kdl, blank_arg_type.kdl, blanl_prop_type.kdl, empty_quoted_prop_key.kdl and empty_quoted_node_id.kdl to invalid, it would need to be specified in the spec. If this is okay, I'll create a PR.

Sep 19 '22 21:09 Patitotective

By the way, is there a reason why some tests use empty and others blank?

Sep 19 '22 21:09 Patitotective

I'm pro-distinct too. I think it's fine if a particular consumer of KDL wants them to be identical, or even if a particular implementation of type hints treat them as identical, but I think that it makes sense that a KDL data model treats them as distinct (essentially, as Option<String>). I'd be opposed to requiring implementations to treat them as identical.

One example would be a particular KDL implementation that uses annotations exclusively as strong type hints. I'd want 123 to be dynamically typed, (f64)123 to be a float, and ("")123 to be an error.

Sep 20 '22 20:09 Lucretiel

I'd be opposed to banning 0-length identifiers; I think that adds more complexity / confusion than it's worth. Currently, an identifier is either "bare identifier" or "quoted string", and I'm not a fan of making it instead a special non-empty subset of "quoted string". I really like how the string is the "escape hatch" into unusual identifiers and don't really see the value in constraining it (especially since a vast majority of languages don't have an ergonomic way to express "string that's definitely not zero-length").

Sep 20 '22 20:09 Lucretiel