zig
zig copied to clipboard
Parser crash when invalid bytes are inside a comment (depends on comment placement)
Zig Version
0.11.0-dev.363+870872dd6
Steps to Reproduce and Observed Behavior
Save the following in a Windows-1252 encoded file:
// ¶
test {}
(the test {} is not important, it can be anything as long as it's a non-comment AFAICT)
Run e.g. zig fmt on it:
thread 33228 panic: integer overflow
C:\Users\Ryan\Programming\Zig\zig\lib\std\zig\parse.zig:3755:37: 0x7ff7c825983c in tokensOnSameLine (zig.exe.obj)
return std.mem.indexOfScalar(u8, p.source[p.token_starts[token1]..p.token_starts[token2]], '\n') == null;
^
C:\Users\Ryan\Programming\Zig\zig\lib\std\zig\parse.zig:209:58: 0x7ff7c82599c1 in warnMsg (zig.exe.obj)
=> if (msg.token != 0 and !p.tokensOnSameLine(msg.token - 1, msg.token)) {
^
C:\Users\Ryan\Programming\Zig\zig\lib\std\zig\parse.zig:236:22: 0x7ff7c843840a in failMsg (zig.exe.obj)
try p.warnMsg(msg);
^
C:\Users\Ryan\Programming\Zig\zig\lib\std\zig\parse.zig:222:25: 0x7ff7c8436d71 in fail (zig.exe.obj)
return p.failMsg(.{ .tag = tag, .token = p.tok_i });
^
C:\Users\Ryan\Programming\Zig\zig\lib\std\zig\parse.zig:604:43: 0x7ff7c8436b7d in expectTestDecl (zig.exe.obj)
if (block_node == 0) return p.fail(.expected_block);
^
C:\Users\Ryan\Programming\Zig\zig\lib\std\zig\parse.zig:616:32: 0x7ff7c825ef28 in expectTestDeclRecoverable (zig.exe.obj)
return p.expectTestDecl() catch |err| switch (err) {
^
C:\Users\Ryan\Programming\Zig\zig\lib\std\zig\parse.zig:277:75: 0x7ff7c825c25b in parseContainerMembers (zig.exe.obj)
const test_decl_node = try p.expectTestDeclRecoverable();
^
C:\Users\Ryan\Programming\Zig\zig\lib\std\zig\parse.zig:60:58: 0x7ff7c811b2c4 in parse (zig.exe.obj)
const root_members = try parser.parseContainerMembers();
^
C:\Users\Ryan\Programming\Zig\zig\src\main.zig:4361:33: 0x7ff7c8272372 in fmtPathFile (zig.exe.obj)
var tree = try std.zig.parse(fmt.gpa, source_code);
^
C:\Users\Ryan\Programming\Zig\zig\src\main.zig:4286:16: 0x7ff7c8121d9d in fmtPath (zig.exe.obj)
fmtPathFile(fmt, file_path, check_mode, dir, sub_path) catch |err| switch (err) {
^
C:\Users\Ryan\Programming\Zig\zig\src\main.zig:4251:56: 0x7ff7c80fe6c4 in cmdFmt (zig.exe.obj)
try fmtPath(&fmt, file_path, check_flag, fs.cwd(), file_path);
^
C:\Users\Ryan\Programming\Zig\zig\src\main.zig:263:22: 0x7ff7c80d251a in mainArgs (zig.exe.obj)
return cmdFmt(gpa, arena, cmd_args);
^
C:\Users\Ryan\Programming\Zig\zig\src\stage1.zig:56:24: 0x7ff7c824cbd7 in main (zig.exe.obj)
stage2.mainArgs(gpa, arena, args) catch unreachable;
^
This is caused by p.token_starts[token1] being larger than p.token_starts[token2]. Specifically, it's attempting:
p.source[5..3]
Note that if the comment is moved after the test block like this:
// either of the below placements avoid the panic
test {} // ¶
// ¶
then we get a nice error message as expected:
invalid.zig:2:12: error: expected type expression, found 'invalid bytes'
test {} // �
^
invalid.zig:2:12: note: invalid byte: '\xb6'
test {} // �
^
Expected Behavior
invalid.zig:1:4: error: expected type expression, found 'invalid bytes'
// �
^
invalid.zig:1:4: note: invalid byte: '\xb6'
// �
^
I investigated this a bit. Seems like the problem is that it finds the invalid bytes token after the test token. If the comment contains valid bytes, it doesn't produce a separate token at all, so i assume it's because comments are considered part of the same token as the thing they're on.
I noticed that the tokenizer seems designed intentionally to do this; it has a pending_invalid_token field that it stores the invalid char in, but then goes on until it finds a valid token, returns that first, then next time it returns the pending invalid token.
There must a reason it's designed this way, but I can't imagine what it is. I would like to get whoever wrote this to weigh in. I was able to fix this issue with a patch that makes it so, after finding the valid token, it first returns the invalid one, then stores the valid token in pending_invalid_token (so it'll be returned next time). Obviously not a good solution because it abuses the field name.
My change also causes this test to fail:
test "invalid literal/comment characters" {
try testTokenize("\"\x00\"", &.{
.string_literal,
.invalid,
});
(But works if I switch the order of expected tokens)
According to git blame, the pending_invalid_token design was implemented in 2017 by Josh Wolfe; at that time it was for catching ASCII control codes. I don't know his github username, though.
Probably related, zig version 0.12.0-dev.2763+7204eccf5 (for me on Windows 11) seems to have an off-by-one error when outputting the invalid byte for me.
On a 4-byte file with only //xL , where x is an unprintable ASCII character (tried with 0x00, 0x01, 0x15, 0x16), zig fmt and zig test both give the following output:
main.zig:1:3: error: expected type expression, found 'invalid bytes'
//L
^
main.zig:1:4: note: invalid byte: 'L'
//L
^
obviously the invalid byte isn't 'L' but the byte before it - the second output is off-by-one.
Also note that the invalid byte isn't displayed at all - both CMD and Windows Terminal simply skip it (while f.e. git bash does display a character there) - should this maybe be accounted for?
Cannot reproduce on master or 0.13.0 Just to be sure, here is the hexdump of the file I tested
00000000 2f 2f 20 b6 0a 74 65 73 74 20 7b 7d 0a |// ..test {}.|
Additionally the off by one error was fixed by #20559