tree-sitter-c
tree-sitter-c copied to clipboard
Missing printf specifiers in grammar
printf specifiers should be their own grammar type as a child of the string literal grammar.
For example, Vim has a cFormat highlight group that can be changed independently of cString.
I think something like this would work
string_literal: $ => seq(
choice('L"', 'u"', 'U"', 'u8"', '"'),
repeat(choice(
token.immediate(prec(1, /[^\\"\n]+/)),
$.escape_sequence,
$.type_specifier // new
)),
'"',
),
// snip
type_specifier: $ => token(prec(1, seq(
'%',
choice(
/[csdioxXfFeEaAgGnp%]/,
/l[cs]/,
/(hh?|ll?|j|z|t)[dioxXn]/,
/(l|L)[fFeEaAgG]/
)
))),
A table of valid type specifiers can be found here. This is untested beyond checking the regexes in node but if the issue becomes stale I'll just figure it out and open a PR.
I don't think this should be done for every string literal. The specifier format isn't a property of string literals as a whole in C. It's just a convention used by the printf family of functions in libc.
I think it'd be better to do this using language injection, or some mechanism that can layer additional syntax highlighting on top of the C syntax tree.
I think it'd be better to do this using language injection, or some mechanism that can layer additional syntax highlighting on top of the C syntax tree.
I go further. I think the the languague should't even try to match escape sequences. The literal parsing should be done elsewhere (or with some future extension to tree-sitter).
I've found that defining strings literals with the code bellow is more robust, tree-sitter is able to recover from incomplete code without highlighting everthing as error.
string_literal: $ => choice(
token(seq(/"([^"\\\n]|\\.)*"/)), // 1
token(seq(/"([^"\\\n]|\\.)*/),token.immediate('"')), // 2
),
The choice (2) is never actually matched (because the how the precedence works in tree-sitter). But it lets tree-sitter knows a quotation mark (") is expected and produce the message: MISSING " when the user only typed ". It also matches empty string.
@lilibyte The actual syntax is more complex than that.
And the standart also defines a format string for strtime(). So it gets messy having a single rule for all conversion specifiers.
Decided to give a shot: https://github.com/alemuller/tree-sitter-printf
The nice thing of having a standalone parser is that it precisely highlighted the errors in the specifiers.