starlark icon indicating copy to clipboard operation
starlark copied to clipboard

Clarify newline tokenization

Open alandonovan opened this issue 4 years ago • 5 comments

An escaped newline in a string literal (i.e. a backslash at end of line) is ignored by the scanner. An escaped carriage return should be treated the same way, and specified as such. See https://github.com/google/starlark-go/issues/144 and https://github.com/google/starlark-rust/issues/280.

alandonovan avatar Jan 25 '21 17:01 alandonovan

To be clear, it seems like "treated the same way" means that the character put into the string literal is LF, regardless of whether the NEWLINE token was generated by a physical LF, CR, or CRLF. That's that Python behavior:

>>> ord(r'''
... ''')
10

where the second line is produced either by hitting Enter or Ctrl+M. That's the behavior we're talking about codifying here?

brandjon avatar Mar 12 '25 20:03 brandjon

This was about an escaped newline within a string literal. There is no NEWLINE token, only a string.

Unescaped:

xtools$ python3 -c $'print(ord("""\n"""))'
10
xtools$ python3 -c $'print(ord("""\r"""))'
10
xtools$ starlark -c $'print(ord("""\r"""))'
10
xtools$ starlark -c $'print(ord("""\n"""))'
10

Escaped:

xtools$ python3 -c $'print("""\\\n""")'

xtools$ python3 -c $'print("""\\\r""")'

xtools$ starlark -c $'print("""\\\n""")'

xtools$ starlark -c $'print("""\\\r""")'

adonovan avatar Mar 12 '25 20:03 adonovan

I think the formatting swallowed your console output in the escaped case.

But what I'm understanding is that the correct behavior is that CR is treated the same as LF (or CRLF), whether it's escaped or not, and whether it's in a raw string literal or not, and whether it's in a triple-quoted string or not. All because CR and CRLF are normalized to LF prior to the processing of string literals (not sure if it's before tokenization itself or just a part of it).

brandjon avatar Mar 14 '25 16:03 brandjon

The Escaped examples were intended to demonstrate that the actual call was print(""), since print adds a newline. Perhaps a clearer demonstration would be:

python3 -c $'print(len("""\\\n"""))'
0

In other words, an escaped newline or carriage return in a multiline string literal denotes the empty string.

adonovan avatar Mar 14 '25 17:03 adonovan

Ok, I get it now -- third time's the charm.

I think we can generalize this issue to expanding the TODO in the spec for defining the tokenization of newlines. If we just say that all CR and CRLF are normalized to LF that should also imply the correct behavior for triple-quoted string literals. Of course we can call out this special case too for clarity.

brandjon avatar Mar 16 '25 16:03 brandjon