tree-sitter-c icon indicating copy to clipboard operation
tree-sitter-c copied to clipboard

Missing printf specifiers in grammar

Open lilibyte opened this issue 3 years ago • 3 comments

printf specifiers should be their own grammar type as a child of the string literal grammar.

For example, Vim has a cFormat highlight group that can be changed independently of cString.

lilibyte avatar Aug 25 '22 21:08 lilibyte

I think something like this would work

string_literal: $ => seq(
  choice('L"', 'u"', 'U"', 'u8"', '"'),
  repeat(choice(
    token.immediate(prec(1, /[^\\"\n]+/)),
    $.escape_sequence,
    $.type_specifier  // new 
  )),
  '"',
),

// snip

type_specifier: $ => token(prec(1, seq(
  '%',
  choice(
    /[csdioxXfFeEaAgGnp%]/,
    /l[cs]/,
    /(hh?|ll?|j|z|t)[dioxXn]/,
    /(l|L)[fFeEaAgG]/
  )
))),

A table of valid type specifiers can be found here. This is untested beyond checking the regexes in node but if the issue becomes stale I'll just figure it out and open a PR.

lilibyte avatar Aug 25 '22 22:08 lilibyte

I don't think this should be done for every string literal. The specifier format isn't a property of string literals as a whole in C. It's just a convention used by the printf family of functions in libc.

I think it'd be better to do this using language injection, or some mechanism that can layer additional syntax highlighting on top of the C syntax tree.

maxbrunsfeld avatar Aug 26 '22 01:08 maxbrunsfeld

I think it'd be better to do this using language injection, or some mechanism that can layer additional syntax highlighting on top of the C syntax tree.

I go further. I think the the languague should't even try to match escape sequences. The literal parsing should be done elsewhere (or with some future extension to tree-sitter).

I've found that defining strings literals with the code bellow is more robust, tree-sitter is able to recover from incomplete code without highlighting everthing as error.

string_literal: $ => choice(
  token(seq(/"([^"\\\n]|\\.)*"/)), // 1
  token(seq(/"([^"\\\n]|\\.)*/),token.immediate('"')), // 2
),

The choice (2) is never actually matched (because the how the precedence works in tree-sitter). But it lets tree-sitter knows a quotation mark (") is expected and produce the message: MISSING " when the user only typed ". It also matches empty string.

@lilibyte The actual syntax is more complex than that.

And the standart also defines a format string for strtime(). So it gets messy having a single rule for all conversion specifiers.


Decided to give a shot: https://github.com/alemuller/tree-sitter-printf

The nice thing of having a standalone parser is that it precisely highlighted the errors in the specifiers.

alemuller avatar Aug 26 '22 03:08 alemuller