zig-spec icon indicating copy to clipboard operation
zig-spec copied to clipboard

grammar clarifications regarding tabs and carriage returns

Open andrewrk opened this issue 3 years ago • 18 comments

NL = New Line (0x0a) CR = Carriage Return (0x0d) TAB = Tab (0x09)

Inside Line Comments and Documentation Comments

  • Any TAB is rejected by the grammar since it is ambiguous how it should be rendered.
  • CR directly preceding NL is unambiguously part of the newline sequence. It is accepted by the grammar and removed by zig fmt, leaving only NL.
  • CR anywhere else is rejected by the grammar.

Inside Multi-Line String Literals

  • zig fmt may not mangle multi line string literals, and therefore the control character TAB are rejected by the grammar inside multi-line string literals.
  • CR inside the multiline string literal is also rejected for the same reason
  • However CR directly before NL is interpreted as only a newline and not part of the multiline string. zig fmt will delete the CR.

For string literals that want to include CR, TAB, or any other control sequences, they will need to use regular string literals and the ++ operator, or @embedFile.

Whitespace

  • TAB used as whitespace is unambiguous. It is accepted by the grammar and replaced by the canonical whitespace by zig fmt.
  • CR used as whitespace, whether directly preceding NL or stray, is still unambiguously whitespace. It is accepted by the grammar and replaced by the canonical whitespace by zig fmt.

astronaute-meme

andrewrk avatar Jul 08 '21 01:07 andrewrk

This

zig fmt may not mangle multi line string literals, and therefore the control characters TAB and CR are rejected by the grammar inside multi-line string literals.

causes a problem when using git on windows, unless I'm missing something.

If you clone this gist (which contains no CR) on Windows with the default git configuration, it produces this:

PS C:\Users\vincent\dev\tmp\zig-invalid-bytes2> xxd .\main.zig
00000000: 666e 2063 7265 6174 6554 6573 7454 6162  fn createTestTab
00000010: 6c65 7328 2920 2176 6f69 6420 7b0d 0a20  les() !void {..
00000020: 2020 2063 6f6e 7374 2062 6172 203d 2026     const bar = &
00000030: 5b5f 5d5b 5d63 6f6e 7374 2075 387b 0d0a  [_][]const u8{..
00000040: 2020 2020 2020 2020 5c5c 666f 6f62 6172          \\foobar
00000050: 0d0a 2020 2020 7d3b 0d0a 2020 2020 5f20  ..    };..    _
00000060: 3d20 6261 723b 0d0a 7d                   = bar;..}

which does not compile:

PS C:\Users\vincent\dev\tmp\zig-invalid-bytes2> zig build-exe .\main.zig
.\main.zig:3:17: error: expected ',', found invalid bytes
        \\foobar
                ^
.\main.zig:3:18: note: invalid byte: '\n'
        \\foobar
                 ^

it is the intended behaviour but IMO it's surprising. Currently as far as I can see you need to either configure git with core.autocrlf=false or use a gitattributes files.

vrischmann avatar Jul 15 '21 10:07 vrischmann

It still surprises me that Git's default configuration is to give you the wrong bytes on windows. Can you clarify which client you're using? is it provided by GitHub? by git-scm.com? by a cygwin package manager (probably not this one)?

As much as I'd like to say that Git is wrong (which i kinda did in the last paragraph), the conflict is arising from the concept of a "text file", which many people think is an array of lines rather than a single byte array. Sometimes discussions of character encoding get tangled up in the concept of a "text file" too.

But Zig is not interested in all the complexity of "text files", and specifies always UTF-8, always spaces over tabs, always LF line endings. There's some room for accepting and canonicalizing (in zig fmt) unambiguous alternatives for everyone's convenience, but CRLF line endings are officially the wrong way to end lines in Zig (along with NEL, LS, and PS). While it bothers me emotionally that some Git cient on Windows is wasting our time here talking about CRLF line endings, it seems important to get this right to avoid even more wasted time for everyone in the future.

With that preamble ramble out of the way: i think we should definitely canonicalize CRLF line endings in multiline string literals to LF line endings.

thejoshwolfe avatar Jul 15 '21 13:07 thejoshwolfe

I'm using the installer from git-scm.

vrischmann avatar Jul 15 '21 14:07 vrischmann

Couldn't we just add an error like file uses tabs/carriage returns; run 'zig fmt {path-to-file}' to convert it to the canonical form. It would allow us to completely ban tabs/CRLF while also providing a pretty decent user experience.

If mangling the files is an issue, the feature can be put behind a --convert-whitespace flag.

Vexu avatar Jul 15 '21 14:07 Vexu

could even make zig build call zig fmt on the whole directory before running, or otherwise tell them to run it on the whole folder since its likely not the one file thats mangled

nektro avatar Apr 28 '22 23:04 nektro

Couldn't we just add an error like file uses tabs/carriage returns; run 'zig fmt {path-to-file}' to convert it to the canonical form. It would allow us to completely ban tabs/CRLF while also providing a pretty decent user experience.

Yes, I'm thinking along the same lines. Make it a hard error, just like unused locals, and then have --autofix (as well as zig fmt) make the compilation "just work" while modifying the source files to comply.

andrewrk avatar Oct 31 '22 00:10 andrewrk