Embedded Language Indicators for raw string literals

[x] Proposed
[ ] Prototype: Not Started
[ ] Implementation: Not Started
[ ] Specification: Not Started

Summary

When we were designing raw string literals, we intentionally left the door open for putting a language indicator at the end of the opening """ for the multi-line form. This proposal adds the support to do that.

Motivation

In the BCL, we added StringSyntaxAttribute for applying to parameters, which allows parameters to indicate the strings passed to them contain some form of embedded language, which is then used for syntax highlighting. However, this only works for strings passed directly to the parameter. For strings first stored in a variable, the only solution is a // lang = x comment. This means that, if the IDE wants to extract a multi-line raw string literal, it cannot neatly preserve the highlighting that was used. This syntax form is intended to help bridge that gap.

Detailed design

The existing raw string literal proposal has the following multi-line grammar:

multi_line_raw_string_literal
    : raw_string_literal_delimiter whitespace* new_line (raw_content | new_line)* new_line whitespace* raw_string_literal_delimiter
    ;

This is updated to the following:

multi_line_raw_string_literal
    : raw_string_literal_delimiter identifier? whitespace* new_line (raw_content | new_line)* new_line whitespace* raw_string_literal_delimiter
    ;

Where the identifier? token is added right after the delimiter.

Drawbacks

This form is not equally applicable to all string types, so it would only apply to multi-line raw string literals. Ideas on other forms that could be more broadly applied would be useful: maybe putting the identifier after the closing quote could work?

Alternatives

Unresolved questions

Design meetings

https://github.com/dotnet/csharplang/blob/main/meetings/2022/LDM-2022-09-21.md#embedded-language-indicators-for-raw-string-literals
https://github.com/dotnet/csharplang/blob/main/meetings/2023/LDM-2023-10-09.md#embedded-language-indicators-for-raw-string-literals

Jun 27 '22 22:06 333fred

i would not limit this to identifier, as that would not allow things like C# or even things like CS-Statement etc. Feels like it would be something akin to raw_string_literal_delimiter not-whitespace-not-new-line+ whitespace* new_line

Jun 27 '22 22:06 CyrusNajmabadi

i would not limit this to identifier, as that would not allow things like C# or even things like CS-Statement etc.

Not to argue either way, but that doesn't seem to limit markdown.

Jun 27 '22 23:06 HaloFour

you can use ```c# in markdown. For example:

class Foo { }

Jun 27 '22 23:06 CyrusNajmabadi

Markdown also accepts file extensions which I find easier to type.. but either way backtick is not going to be (re)considered for string literals, right?

Jun 28 '22 11:06 alrz

@alrz I would only support backticks if we were actually adding support for markdown (something I do want).

Jun 28 '22 15:06 CyrusNajmabadi

Why limit it to only language indicators? It could be anything, e. g. locale or color or any other editor hint.

Jun 28 '22 23:06 vladd

Sure, we won't be stopping whatever you want to put there. However, my intention with this proposal is that editors will use it to drive interior language highlighting.

Jun 28 '22 23:06 333fred

It's not limited to only language indicators. It's just that that's a primary consumption case.

e. g. locale or color or any other editor hint.

These are also 'language indicators' :)

Jun 28 '22 23:06 CyrusNajmabadi

Why not this?

return """ // lang = cs
class Foo { }
""";

A single-line comment would fit after """, after all. Having a Markdown-style identifier there is neat, but I would be confused if I saw it somewhere, thinking that it would somehow affect the type of the string. A comment conveys the meaning well, and it keeps the existing format. Unless you actually want to be able to programmatically extract the language information...

Jul 07 '22 20:07 IS4Code

Why not this?

Primarily verbosity. It seems esp. excessive given how markdown is commonly used to write ```c#.

A comment conveys the meaning well

If you prefer that, that's already supported. You can do both:

// lang=c#
return """
class Foo { }
""";

Or

return /* lang=c# */ """
class Foo { }
""";

Given that, we don't need an interior-form of this comment. But having a simple interior form that is much less verbose than the comment form would be nice.

Jul 07 '22 20:07 CyrusNajmabadi

maybe putting the identifier after the closing quote could work

Something like this?

var example = """
    SELECT * FROM table
    """sql;

var example = """SELECT * FROM table"""sql;

I prefer that TBH. It's low-importance metadata and is more "out of the way" when it's appended. It's also "outside of the string" this way, like a tag.

Is the above technically possible? Is it possible to add whitespace before the "tag"?

var example = """
    SELECT * FROM table
    """ sql;

var example = """SELECT * FROM table""" sql;

It's cleaner/less "squashed" that way.

Jul 08 '22 18:07 glen-84

@glen-84 yes, those are potential alternatives we can consider.

However, it is unlikely as "text on outside" already has meaning today and actually affects the semantics of hte string. e.g. """X"""u8 means "this is a utf8 string". The point of "text on inside" is that it has no meaning to hte language. It's effectively trivia used for other tools to decide what to do.

Jul 08 '22 18:07 CyrusNajmabadi

I see. In a way that could also be seen as a "tag" or "metadata", so it could make sense to extend that in a more general sense, in a way that clearly indicates its user-defined nature.

// Would this be confused with a preprocessor directive?
var example = """SELECT * FROM table"""#sql;

var example = """SELECT * FROM table"""u8#sql;

This could apply equally to regular strings.

"Tagged string literals"

Jul 08 '22 19:07 glen-84

@glen-84 We'll keep those alternatives in mind when designing this. Thanks!

Jul 08 '22 20:07 CyrusNajmabadi

Not to bikeshed too much but I'm a bit torn. I do like the markdown-like approach of having the tag on the opening line of the raw literal. I think it's easier to see what the dialect is without having to find the end of the literal, plus it's familiar. But that poses a problem with single-line raw literals. Having the tag at the end for a single line literal looks nicer, but collides with decision to use a suffix to denote UTF-8 literals. The "tag" approach does solve that, but IMO isn't very attractive. Maybe a prefix?

var example1 = sql"""SELECT * FROM table;""";
var example2 = sql"""
    SELECT *
    FROM table;
""";

And while I'm partial to the feature it does feel a little weird that the syntax would only exist to facilitate tooling. Almost feels like something better served through source attributes.

Jul 08 '22 20:07 HaloFour

outside of hte string as a big problem in terms of detection and allowable syntax. Within teh string, we can literally allow anything. e.g. c# (where # could easily be a problem outside of the string.

Jul 08 '22 22:07 CyrusNajmabadi

I'm not sure whether something like the following would work but maybe quote the hint like this:

var example1 = """sql"SELECT * FROM table;"""";
var example2 = """sql"
    SELECT *
    FROM table;
"""";

If not then maybe quoting it like this:

var example1 = """"sql"SELECT * FROM table;""";
var example2 = """"sql"
    SELECT *
    FROM table;
""";

Jul 11 '22 07:07 iam3yal

@eyalalonn That's hard to parse visually at least. The entire meaning changes if you add one more " at the end of the single-line form, which may not be visible without horizontal scrolling as you're reading the code.

For the multiline form, I prefer not having those extra quotes.

Jul 11 '22 14:07 jnm2

@jnm2 I agree so maybe like you said we can just do without the quoting when multiline is used and have the quoting when it's needed like this:

var example1 = """sql"SELECT * FROM table;"""";
var example2 = """sql
    SELECT *
    FROM table;
""";

Jul 11 '22 15:07 iam3yal

6 quotation marks per string is already more than enough. 😅

Jul 11 '22 15:07 glen-84

@glen-84 The more quotes you add the more power you have. 😄

Jul 11 '22 16:07 iam3yal

@eyalalonn This can't work Example:

var s = """Hello"@eyalalonn""""; // your proposal will just write: @eyalalonn (but with a language named Hello 😄 )

Jul 11 '22 17:07 FaustVX

@FaustVX Single-line raw string literals can't be used if the first or last character in the string is a double quote.

Jul 11 '22 18:07 jnm2

@jnm2 Ok, I was just using @eyalalonn example

Jul 11 '22 18:07 FaustVX

@FaustVX Given Eyal's suggested rules, we would expect it to write @eyalalonn with a language named Hello. Can you explain more on what you mean by "this can't work"?

Jul 11 '22 19:07 jnm2

I didn't know the fact that a double-quote at the end of the string doesn't compile. So I thought his proposal will compile but produces something different that raw string literal proposal. That's what i wanted to say by "This can't work"

Jul 11 '22 19:07 FaustVX

Whatever the format, I think the indicator should proceed the content. In a case like reading from a stream, the indicator would allow redirecting the read process right away. Otherwise you would need to buffer the content until you found out what its context/language is.

Aug 10 '22 22:08 Randy-Buchholz

I'm curious about the benefits of this. In languages like JavaScript you can use this backtick syntax which allows you to have that string passed into a function that can perform some kind of operation on the data.

For example:

html`<p>I am some HTML</p>`

Is it being proposed that C# will allow this kind of scenario or is the string purely for decoration purposes (and can't be interrogated at runtime). Note I'm not necessarily advocating for this as I find that you end up with dubious benefits over actually just passing the string into a method ;) e.g. htmlFunc("<p>I am some HTML</p>")

I guess I'm just wondering what problem is being solved?

edit: I guess the IDE could interpret them and provide an improved experience.

Nov 13 '22 02:11 mitchdenny

@mitchdenny the op lists the motivations. :-)

Nov 13 '22 03:11 CyrusNajmabadi

is the grammar for multi_line_raw_string_literal defined in some Antlr file? or is the grammar represented using other tooling?

Dec 22 '22 20:12 Korporal

csharplang
csharplang copied to clipboard

[Proposal]: Embedded Language Indicators for raw string literals

Embedded Language Indicators for raw string literals

Summary

Motivation

Detailed design

Drawbacks

Alternatives

Unresolved questions

Design meetings

csharplang csharplang copied to clipboard

[Proposal]: Embedded Language Indicators for raw string literals

Embedded Language Indicators for raw string literals

Summary

Motivation

Detailed design

Drawbacks

Alternatives

Unresolved questions

Design meetings

csharplang
csharplang copied to clipboard