zig fmt: Canonicalize symbol names

Fixes #166

Overview

This adds new behavior to zig fmt which normalizes (renders a canonical form of) quoted identifiers like @"hi_mom" to some extent. This can make codebases more consistent and searchable.

To prevent making the details of Unicode and UTF-8 dependencies of the Zig language, only bytes in the ASCII range are interpreted and normalized. Besides avoiding complexity, this means invalid UTF-8 strings cannot break zig fmt.

Both the tokenizer and the new formatting logic may overlook certain errors in quoted identifiers, such as nonsense escape sequences like \m. For now, those are ignored and we defer to existing later analysis to catch.

This change is not expected to break any existing code.

Behavior

Below, "verbatim" means "as it was in the original source"; in other words, not altered.

If an identifier is bare (not quoted) in the original source, no change is made. Everything below only applies to quoted identifiers.
Quoted identifiers are processed byte-wise, not interpreted as a UTF-8 sequence.
\x and \u escapes are processed.
- If the escape sequence itself is invalid, the sequence is rendered verbatim.
- If the un-escaped codepoint is not ASCII, the sequence is rendered verbatim.
- Otherwise, the character is rendered with formatEscapes rules: either literally or \x-escaped as needed.
If the resulting identifier still contains escapes, it it remains quoted.
If the resulting identifier is not a valid bare symbol ([A-Za-z_][A-Za-z0-9_]*), it remains quoted.
If the resulting identifier is a keyword, it remains quoted.
If the resulting identifier is a primitive value/type (i33, void, type, null, _, etc.), it is rendered unquoted if the syntactical context allows it (field access, container member), otherwise it remains quoted.
Otherwise, it is unquoted. Celebrate in the manner of your choosing.

Oct 29 '22 07:10 hryx

The allocator is used to parse symbols escaped with @"", but there is probably a more efficient approach. I was thinking of adding something like this to std/zig/string_literal.zig:

Hmm, yeah there should be no need to allocate here. inlining the logic of std.zig.isValidId() into the parsing would avoid the need to do any extra in-memory copy in the case where nothing needs to be changed. The function signature you suggest that takes a writer would solve the other case when the output of zig fmt should differ from the input as an identifiers needs to be canonicalized.

Oct 29 '22 10:10 ifreund

Thanks @ifreund, great suggestions. Agreed about guarding against future churn. My original thinking was to avoid duplicating the code from isValidId(), but now I think it makes more sense to combine the parsing and "validation" logic here.

Oct 29 '22 19:10 hryx

The suggestion worked like a charm. Now it covers everything, doesn't allocate, and is much cleaner.

Last concern: I duplicated most of the code from parseAppend into parseWrite. It's not so nice, but the tiny differences didn't make for an obvious (to me) abstraction. I can try my hand at improving code reuse if you'd like, thought I'd push what I have in case you already have a pattern in mind that would fit nicely here.

Once we're good I can squish up these commits. I'll mark it ready now anyway since it's functioning.

Oct 31 '22 01:10 hryx

CI caught a missed case: identifiers that are escaped keywords. Added a stub fix that currently fails but I can come back to look closer shortly. (Apologies for the noise)

Oct 31 '22 03:10 hryx

As mentioned by @InKryption in a comment:

fields with the name type can be written without @"" syntax

This reminds me of some cases I hadn't considered:

const Pumpkin = struct {
    type: Squash, // define field name
    const Squash = struct {};
};

fn f() void {
    var p: Pumpkin = .{
        .type = .{}, // initializer
    };

    p.type = .{}; // field access
}

Identifiers spelled like keywords without @"" escaping are legal in those cases. Should the above cases (maybe others I'm forgetting) be un-escaped?

Oct 31 '22 08:10 hryx

Identifiers spelled like keywords without @"" escaping are legal in those cases. Should the above cases (maybe others I'm forgetting) be un-escaped?

type isn't actually a keyword but rather an identifier in the global scope. Your code should actually already handle that example properly as std.zig.isValidId() returns true for type.

Oct 31 '22 10:10 ifreund

type isn't actually a keyword but rather an identifier in the global scope. Your code should actually already handle that example properly as std.zig.isValidId() returns true for type.

You are so right, it's not a keyword, I had a late-night brainfart. But there is still a problem as revealed by CI (which I'd hastily misinterpreted):

/usr/home/build/zig/zig-cache/o/e7f95f958298c1559827c567f0fba9c1/cimport.zig:669:24: error: name shadows primitive 'type'
pub inline fn CLITERAL(type: anytype) @TypeOf(type) {
                       ^~~~
/usr/home/build/zig/zig-cache/o/e7f95f958298c1559827c567f0fba9c1/cimport.zig:669:24: note: consider using @"type" to disambiguate

The note tells me that you can use @"" to escape primitive shadowing, is that correct? Whereas local/global shadowing is never allowed.

Additionally, my previous comment had an incomplete example. Keeping "@type" quoted is necessary for vars and args (because of shadowing a primitive), while they can safely be un-escaped for struct fields.

The last complication here is:

fn f(@"type": bool) void {
    // These refer to different things:
    _ = @"type";
    _ = type;
}

Oct 31 '22 20:10 hryx

(Pardon my edits if you are reading this after an email alert)

Because of the cases I previously overlooked, I thought I'd write out my assumptions below to make sure I'm targeting the right behavior.

The \u -> \x thing hasn't been mentioned anywhere as far as I know, so let me know if you agree with the proposal. I could see a case for doing it or not doing it, no strong opinion, but it's easy to do.

Behavior of canonicalization

Identifiers not quoted with @"" syntax are always written raw, and none of the below applies to them.

A quoted identifier must remain quoted if any of the following apply to it:

It is empty.
It contains characters that (once unescaped) do not match [A-Za-z0-9_].
It is equal to a keyword.
It would shadow a primitive type or value (e.g., bool, u64, type, _, true) in the current syntactical construct. This applies to variable decls/references, function names, and parameters, but not to struct field definition/access or enum literals.

Otherwise, it will be rendered unquoted.

In either case, when rendering, all escaped characters are unescaped if they would be legal to appear in a Zig string literal. (This is a superset of the alphanumeric pattern above, so we'll never incorrectly render e.g. a ? while unquoted.) Any \u escape that fits into one byte is normalized to a \x escape.

This is opening up a bit is because it introduces normalization even when the thing remains quoted. I believe that fits the spirit of the original issue, but let me know if I'm reading too deeply into it!

Nov 02 '22 07:11 hryx

The note tells me that you can use @"" to escape primitive shadowing, is that correct? Whereas local/global shadowing is never allowed.

Ah yes, this is correct and makes this canonicalization a good bit more complicated unfortunately since it must be aware of where in the AST the identifier appears.

You're probably wondering how best to identify primitive types. The current answer to this is the isPrimitive() function in src/AstGen.zig in the self hosted compiler. I see that translate-c already imports this function from AstGen for it's own purposes. It probably makes sense to move it to std.zig since we need it here as well.

Those assumptions look great, thanks for taking the time to write them out explicitly :)

This is opening up a bit is because it introduces normalization even when the thing remains quoted. I believe that fits the spirit of the original issue, but let me know if I'm reading too deeply into it!

This canonicalization from @"\x65\x72\x72\x6F\x72" to @"error" isn't something I had personally considered but it certainly fits the spirit of the issue and I think it would definitely be worth implementing.

This issue is turning out to be even more of a rabbit hole than I initially thought :D

Nov 02 '22 11:11 ifreund

Snap, somehow my eyes glazed over this comment: https://github.com/ziglang/zig/blob/b40fc70188fd097e9e05b5c18b635b67b718380e/lib/std/zig/render.zig#L187-L188

To keep things simple, I'd be ok worrying about that till after this is PR is done since renderSpace() has no significant impact here. (On the other hand, the only thing I love more than writing code is deleting code, so...)

Nov 03 '22 05:11 hryx

The current answer to this is the isPrimitive() function in src/AstGen.zig in the self hosted compiler. I see that translate-c already imports this function from AstGen for it's own purposes. It probably makes sense to move it to std.zig since we need it here as well.

It looks like the heart of isPrimitive() is the enum defined in src/Zir.zig, and I'm guessing it's pretty far-fetched to move ZIR into stdlib at this point. Maybe the enum itself can live in std, I just didn't see an obvious existing home for it.

What I'll do for tonight is copypaste the enum so as to get this change up to spec and get CI passing, then worry about organization tomorrow -- there's still cleanup to do anyway.

As a second matter, the secret final boss of rendering quoted identifiers is non-ASCII codepoints. I hadn't considered whether we should normalize e.g. @"type\u{1f335}" to @"type🌵", or even vice versa.

I think I have a solve for now, but my dream solution is adopting the %q format specifier from Go. It normalizes strings more or less the way we want here. (Ignore superficial details like escape syntax)

Example: https://go.dev/play/p/ZZPRr5r9TS9

func main() {
	fmt.Printf("Quoted by %%q specifier:               %q\n", "\u26a1 abc? ' \" 世界 cąbbäge \t 🐢")
	fmt.Printf("Quoted (ASCII-only) by %%+q specifier: %+q\n", "\u26a1 abc? ' \" 世界 cąbbäge \t 🐢")
	fmt.Printf("Quoted by %%#q specifier:              %#q\n", "\u26a1 abc? ' \" 世界 cąbbäge \t 🐢")
}

Quoted by %q specifier:               "⚡ abc? ' \" 世界 cąbbäge \t 🐢"
Quoted (ASCII-only) by %+q specifier: "\u26a1 abc? ' \" \u4e16\u754c c\u0105bb\u00e4ge \t \U0001f422"
Quoted by %#q specifier:              `⚡ abc? ' " 世界 cąbbäge 	 🐢`

It might not be a blocker for this PR though, let's see how the fix goes. I'd be very happy to propose or implement it.

Nov 03 '22 07:11 hryx

CI error after formatting All The Code:

[ 98%] Built target zig1
[ 99%] Building stage2 object /Users/runner/work/1/s/build/zig2.o
./src/Zir.zig:2370:36: error: expected token ')', found '.'
                .ty = Type.initTag(.undefined),
                                   ^

My quick guess is there's a discrepancy in the stage1 and self-hosted tokenizers. I'll browse the C++, hopefully it's narrowly-scoped.

Nov 04 '22 09:11 hryx

My quick guess is there's a discrepancy in the stage1 and self-hosted tokenizers. I'll browse the C++, hopefully it's narrowly-scoped.

To be pedantic, the difference is in the parser not the tokenizer. In the self hosted compiler true, false, null, and undefined are primitives while in the C++ compiler they are still keywords. This was changed for stage2 in 05cf44933d753f7a5a53ab289ea60fd43761de57. I'm not sure what the least painful way to work around that discrepancy for the purposes of this PR would be, perhaps backporting that change to the C++ codebase.

Nov 04 '22 12:11 ifreund

Ok, we had a chat and made some design conclusions and have a clear path to simplify/finish this PR.

Spec changes:

The Zig language should not depend on the Unicode specification. Only ASCII characters will be normalized in identifiers. Tossing Unicode out of the equation means the formatter doesn't have to worry about invalid codepoints or invalid UTF-8 sequences. @"🐢" and @"\u{01f422}" will have to coexist.
We'll still parse \x and \u escapes. If parsing fails, render verbatim. If not ASCII, render verbatim. Otherwise, render with fmtEscapes so the result is a valid string literal.
No \u -> \x simplification beyond the process described above. I was conflating codepoints with bytes, which in UTF-8 are not equal above 7F.

Path to merge:

Back-port the keyword change to C++.
Implement above behavior changes.
Get CI passing.
Refine code style and reorganize stuff, including and stdlib changes.
Clean up commits and merge. :beers:

Other notes:

I uncovered that the tokenizer accepts nonsense escapes like \m, deferring the error to a later parsing stage. To keep the PR scoped, it's fine to keep that behavior in the new render code too.

Nov 05 '22 00:11 hryx

(Edited out irrelevant updates)

The render logic should be up to snuff now, just awaiting CI results.

Previously, translate-c was was creating string/character literal AST nodes with identifier tokens, a conflict which was causing a new render assertion to fail. I could not figure out if/why this was intentional, but changing it got translate-c tests to pass.

Nov 07 '22 10:11 hryx

@ifreund This is now up to spec and ready for review! Here are some notes to keep in mind:

I left the commit history a mess until you are satisfied with the changes themselves, but once you are, I will clean it up to something like this:
1. Backport keyword change to stage1
2. All functional changes
3. Run zig fmt on the whole codebase
I replaced parseAppend with parseWrite in std as discussed, and I like it, although it's not actually necessary for this PR anymore. Happy to remove the change or split it into its own commit if you prefer.
See the later commit messages for a couple of unrelated side effects of the PR, namely https://github.com/ziglang/zig/pull/13343/commits/ead001200d8f5a3448b141de3bd246dc900ea3a2 (bug fix, you already know this one) and https://github.com/ziglang/zig/pull/13343/commits/a9a25aaa29fa13aba3c19105a74b7542b603a3ef.
I'm not sure if I have properly tested the C++ change. I've been relying on CI and a zig-bootstrap'd stage3 (master) to test these updates. Let me know if you have concerns or tips on verifying that portion.

Nov 09 '22 05:11 hryx

Rebased on master and cleaned up. I'll keep this in a perpetual review-ready state from this point.

If you want to refer to any of the earlier commits that are relevant to previous discussion, I pushed the original branch to https://github.com/ziglang/zig/compare/master...hryx:zig:canonical-symbols-scratch

Nov 15 '22 05:11 hryx