jakt icon indicating copy to clipboard operation
jakt copied to clipboard

String Interpolation

Open dangrie158 opened this issue 1 year ago • 19 comments

Hello friends,

I had some time this Sunday and decided to hack a bit on the self hosted jacket compiler to add a feature I was always thinking would really contribute to readability when watching videos of JT of Andreas: String Interpolation!

The current syntax uses curly braces without any prefix so you can write

let interpolation = "there"
println("Hello {interpolation}")

The part between the braces is parsed as an expression so you can also do stuff like:

let part_of_the_answer = 13
println("The answer is {29 + part_of_the_answer}")

The expressions are also type checked. For the code generation I currently check if there are any expressions in the string that need to be interpolated and only then generate a call to String::formatted, otherwise the basic String() constructor is used. I'm sure there are many edge cases I have forgotten but I thought before I put much more time into this I ask the community for comments. I'm sure there are lots of improvements since this is my first time contributing to the project and also hacking on a compiler, so be gentle :^)

Some stuff I'm unsure about:

  • How to handle old style format string when mixed with interpolated expressions? I think it would be best to check that only one style of formatting is used
  • Currently during lexing a list of subtokens is parsed and attached to the Token. Would it be nicer to keep the "Token stream" flat and not embed the information? What would be the best factoring here?

Have a nice Sunday everyone and let me know what you think?

dangrie158 avatar Sep 04 '22 17:09 dangrie158

Hello!

One or more of the commit messages in this PR do not match the Jakt code submission policy, please check the lint_commits CI job for more details on which commits were flagged and why. Please do not close this PR and open another, instead modify your commit message(s) with git commit --amend and force push those changes to update this PR.

BuggieBot avatar Sep 04 '22 17:09 BuggieBot

Currently these are indistinguishable from normal strings, so cosegen emitting braces or json generation become needlessly difficult. And I remember discussions on Stringinterpolations running against a wall on discord, maybe discuss this there again

Second

  • Write the commit message subject > line in the imperative mood ("Foo: Change the way dates work", not "Foo: Changed the way dates work").
  • Write your commit messages in proper English, with care and punctuation.

You are missing a verb

Hendiadyoin1 avatar Sep 04 '22 17:09 Hendiadyoin1

This looks pretty interesting. I'm wondering if it makes putting literal { in strings impossible now? there would have to be a way to escape the curly brace. There's some finer syntax things to think about, e.g. should the string be prefixed with something? (e.g. python f-strings f"foo {some_var}" or perhaps "foo <some fancy character>{some_var}".

It's probably worth hopping onto discord to discuss in the #jakt-design channel to see if folks are ok with moving forward with this approach to iterate on, if there's fundamental problems, if there's minor changes requested, etc.

ADKaster avatar Sep 04 '22 20:09 ADKaster

Thanks for your comments,

Currently these are indistinguishable from normal strings, so cosegen emitting braces or json generation become needlessly difficult.

Good point. I did not think about this. This should be fixable in many different ways (see below)

And I remember discussions on Stringinterpolations running against a wall on discord, maybe discuss this there again

That's exactly what I thought although I'm wasn't very active on Discord so I didn't see this discussion. But I thought instead of bike shedding in the Idea stage I first come up with a possible solution as a basis for further discussion

You are missing a verb

Thanks, I hope I fixed this now. Let me know if this is still wrong, as I said this is my first contribution to the project :)

I'm wondering if it makes putting literal { in strings impossible now?

Well, due to a bug it actually was, but that is fixed now, see the example. Its still a bit weird because of different escaping characters in the FormatParser and Jakt lexer but as I wrote in the commit message, this will probably fix itself when we switch to a more "elaborate" format like ${ something }.

As for how to fix your objections, here is an (incomplete) list of possible solutions with (incomplete) advantages and disadvantages I see.

Current implementation "This will be {interpolated}"

Examples in the wild: PHP (kind of?, I admit I'm going by wikipedia here) Advantages:

  • clean
  • easy
  • possible (as proven by this PR)

Disadvantages:

  • Makes it harder to generate text with lots of curlies (json/code)
  • maybe some scenario where it would be ambiguous?

Special interpolation prefix "This value is ${interpolated}"

Examples in the wild: Swift, Kotlin Advantages:

  • Easy to lex/parse
  • unambiguous
  • fixes the concern of generation code/json mostly (wll, except you want to generate interpolated strings :) )

Disadvanatges:

  • Very slight chance of backward compatibility issues, but easy to grep for

Special string prefix: f"This is an {interpolated} String"

Examples in the wild: Pythons f-String or Nims fmt Strings Advantages:

  • No backward compatibility issues
  • Easy to lex/parse
  • unambiguous

Disadvanatges:

  • Special syntax
  • (I guess) many languages that chose this way did so because they started without string interpolation and needed to be fully backward compatible
  • Probable source of bugs (forgetting the special identifier at the start of the string)

Special string quoting: This is an {interpolated} String

Examples in the wild: JavaScript / TypeScript Advantages:

  • No backward compatibility issues
  • relatively easy to lex / parse
  • unambiguous

Disadvanatges:

  • Yet another quoting style
  • In the case of backticks like JS/TS: Backticks are in some locales hard to type since they are used as a combinatory key by default
  • (I guess) many languages that chose this way did so because they started without string interpolation and needed to be fully backward compatible

No String Interpolation "This makes me {sad}"

Advantages:

  • Less code to manage
  • no possible source of bugs in the compiler Disadvantages:
  • Splitting the format string from the expressions to fill the placeholders is harder to read, especially if many placeholders are filled
  • Jakt is a modern language that should have modern features :)

I hop on discord now and try to start a discussion there.

dangrie158 avatar Sep 05 '22 06:09 dangrie158

I just pushed a new commit that implements the proposed "This is ${interpolated}" syntax that got the most upvotes on discord.

I'm still happy to get any suggestions on the code itself.

All tests pass for me with the current implementation

dangrie158 avatar Sep 05 '22 09:09 dangrie158

Just implemented support in the interpreter:

> let foo = 42
> "The answer is ${foo:o} in octal"
= "The answer is 52 in octal"

dangrie158 avatar Sep 05 '22 12:09 dangrie158

Can you squash your commits?

Hendiadyoin1 avatar Sep 05 '22 17:09 Hendiadyoin1

Can it handle "${ "${ 0 }" }" (nested interpolated strings)?

fmease avatar Sep 05 '22 18:09 fmease

Can you squash your commits?

Sure, done :)

Can it handle "${ "${ 0 }" }" (nested interpolated strings)?

No, but I'm also honestly not sure how you could parse this in any reasonable way. Currently in this case there would be an error that the opening curly is unmatched after the second quote.

> "${ "${ 0 }" }"
Error: unmatched { in interpolated string

I you really want to do something like this for a good reason I think it would be way more readable to embed a format() call into the expression like "${ format(fmt_str, 0) }", however you still can't nest the quotes.

> let fmt_str = "{}"
> "${ format(fmt_str, 0) }"
= "0"

dangrie158 avatar Sep 05 '22 18:09 dangrie158

No, but I'm also honestly not sure how you could parse this in any reasonable way.

You could just look for the matching close curly (simple paren counting), and then parse whatever is in between as an expression. This would handle the case properly

Not that the example in question is necessarily useful - but to me this seems like a more robust way of parsing anyway.

It's always bothered me in other languages where the language should be smart enough to see that the quotes are inside the format-expression.

mustafaquraish avatar Sep 05 '22 22:09 mustafaquraish

but to me this seems like a more robust way of parsing anyway.

Well, I need to refactor the lexing & parsing anyway, so I will give it a go. Guess you are right although it probably makes some things harder (e.g. syntax highlighting). Anyway, I'll give it a go.

BTW. is there a way to mark this PR as a draft?

dangrie158 avatar Sep 06 '22 05:09 dangrie158

I just did a refactor of the lexing & parsing steps. Notable changes:

  • Parsing now supports nested interpolated strings ("${ "${ 0 }" }") which isn't something you should necessarily do but as Mustafa said that's a more robust way of parsing
  • Escaping of an interpolated expression now works using double curlies instead of a backslash. This is how the StringFormatter class does it and makes things much easier to implement since \$ is not a valid escape sequence in C++ and thus needs to be specially treated during lexing and code generation
  • QuotedString and InterpolatedString are now distinct types so its easier to special case type checking and code generation for any of the two types
  • Lexing yields a 1-D stream of tokens again instead of embedding an array of Tokens into a QuotedString

I think this is now ready for a first code review :)

dangrie158 avatar Sep 07 '22 05:09 dangrie158

You have a typo in your commit message: "forat"

networkException avatar Sep 07 '22 06:09 networkException

You have a typo in your commit message: "forat"

Thanks, fixed

dangrie158 avatar Sep 07 '22 06:09 dangrie158

I added syntax highlighting support in the vscode extension just now. Other editors are not yet supported and just render interpolated blocks as part of the string but that's at least a start.

I have this addition in a separate commit for now since I think it makes sense to have those 2 parts separated but let me know if you disagree. From my testing the code works pretty solid now, any comments appreciated.

dangrie158 avatar Sep 08 '22 05:09 dangrie158

I'd be more comfortable with this if the syntax looked a little different than strings. "${foo} bar" should always just be a string so you know what it is when you're looking at it without having to read the whole thing.

Requiring a $ in front of the string to interpolate gives that immediate indication that it's an interpolated string.

let x = 100
let y = 200
let result = $"my result is {x + y}"

sophiajt avatar Sep 09 '22 17:09 sophiajt

Hi jt, I just started a 2 week vacation so I won’t be able to work on this immediately.

"{foo} bar" should always just be a string

well it is, the syntax that was most liked from the alternatives above is "${foo} bar" and wouldn’t even that be a string after interpolation? :)

I‘m not quite sure I understand your concerns. From a developer perspective isn‘t
"${x + y} is the result " nicer to read than any of format("{} is the result ", x + y) or result_str + " is the result "? The latter is a nice example where you actually have to read the whole line to know that the result of the expression is a string. Also the syntax highlighting I implemented for the vscode extension also helps when reading to parse the interpolation part

Anyway, I have absolutely no problem changing the syntax, just want to understand the concern better to find a good solution.

CxByte also had the suggestion of supporting tagged strings someday so I would propose fmt"${x + y} is the result ", then if tagged strings land someday fmt can simply be a runtime comptime function, although that would be possible with the current syntax as well if you just use fmt as the default (Like JS kindof does). In this case $ would be a wierd identifier :^)

dangrie158 avatar Sep 09 '22 19:09 dangrie158

@dangrie158 - my main concern is readability.

As one example, think of a long string

"this is a very very long string and this string may have places that get converted to something but you don't know ${hi}"

In your version, you don't know that's a string interpolation until you read the entire string from start to finish. You need to know this bit of info during reviews and updating existing code because there are cases, namely the kernel, where you must know if you're going to allocate. Having to read the whole string to figure out if its an interpolation slows you down and makes the ergonomics worse if you think of how the language is used for different purposes.

$"this is a very very long string and this string may have places that get converted to something but you don't know {hi}"

Having an indicator on the front of the string tells you immediately if it's an allocating interpolation or a string literal.

sophiajt avatar Sep 11 '22 02:09 sophiajt

Your right, fair point.

I thought about the tagged strings again, if we would allow any comptime function it would probably make sense to also support any return value, then using double quotes makes no sense. If you are interested I could try to implement tagged templates closely following the JS implementation. This would mean changing the quotes to backticks (``) which should also adress your concerns? It would also make a clear separation from string both in the compiler and for the reader. (Plus support for macros which could be nice for some usecases :) )

dangrie158 avatar Sep 11 '22 09:09 dangrie158

hmm as someone who got badly bitten by forgetting to add the f prefix to what I wanted to be a format string in python, would it be ok if we forbid interpolations in non-format strings? I.e. parse them in normal strings also and then throw if the string is not marked as a format string?

maddanio avatar Sep 19 '22 15:09 maddanio

hmm as someone who got badly bitten by forgetting to add the f prefix to what I wanted to be a format string in python, would it be ok if we forbid interpolations in non-format strings? I.e. parse them in normal strings also and then throw if the string is not marked as a format string?

Feels weird to throw a warning / error by default based on the content of a regular string? I don't know if this is a sane default

mustafaquraish avatar Sep 19 '22 16:09 mustafaquraish

I'm closing this in favour of #1195

dangrie158 avatar Oct 10 '22 07:10 dangrie158