syncode icon indicating copy to clipboard operation
syncode copied to clipboard

Invalid json produced with JSON Grammar

Open HJJ256 opened this issue 9 months ago • 11 comments

When testing the JSON grammer with Gemma3-12B-IT, the model produces a json which ends like this "] \xa0}". The entire json structure is

{
    "key": [
        {"inner_key":"value"},
        {"inner_key":"value"}
    ] \xa0}

Using logits processor in "grammar_strict" mode.

HJJ256 avatar Mar 20 '25 21:03 HJJ256

I'll see if I can reproduce this on the JSON eval dataset. Can you also try grammar mask mode?

shubhamugare avatar Mar 20 '25 22:03 shubhamugare

I'll see if I can reproduce this on the JSON eval dataset. Can you also try grammar mask mode?

It worked with "grammar_mask", I thought grammar_strict was supposed to be stricter 🤔 Qwen2.5 was misbehaving with grammar mask by outputting a character {`

HJJ256 avatar Mar 20 '25 22:03 HJJ256

Update: On closer inspection I found that the problem occurs even before the \xa0 symbol as the model output is something like this

{ “key”: [ {“inner_key”:“value”}, {“inner_key”:“value”} ] \xa0}

It is outputting curly quotes instead of ASCII quotes

HJJ256 avatar Mar 20 '25 22:03 HJJ256

remainder : b'"key\xe2\x80\x9d: [ {\xe2\x80\x9ckey\xe2\x80\x9d: ', remainder_state: RemainderState.INCOMPLETE, accept_sequences: {accept_terminals: ['NONEMPTY_STRING'], accept_terminals: ['RBRACE'], accept_terminals: ['WS']}, next_ac_indents: None

Apparently once I get the first "key", it starts outputting the curly quote and the grammar parser is confused because it thinks the non empty string has not ended 😆. This needs to be handled in the json grammar but I am not sure how .lark files handle unicode characters. The unicode characters are \xe2\x80\x9d which is same as \u201D and \xe2\x80\x9c which is same as \u201C

HJJ256 avatar Mar 21 '25 01:03 HJJ256

Thanks for the detail!

According to the official JSON specification (RFC 8259 and ECMA-404), JSON strings must use standard ASCII double quotes (U+0022) only. Unicode curly quotes (U+201C and U+201D) are not valid in JSON so we should disallow those.

The main difference between the two modes is the following. Which one is more useful depends on the task. In general, I observed that grammar_mask works better for JSON.

1. `grammar_mask` (Conservative/Overapproximation): 
       This mode is more permissive and overapproximates the set of acceptable tokens.
       It allows a wider range of tokens that might be syntactically valid given the
       limited lookahead of the parser. This mode preserves more of the LLM's original 
       token distribution while still enforcing basic syntactic correctness.
       
2. `grammar_strict` (Strict/Underapproximation):
       This mode is stricter and underapproximates the set of acceptable tokens.
       It enforces tighter grammatical constraints and may be more invasive in the
       LLM's generation process. It sometimes breaks LLM tokens that would have been
       syntactically correct when considered as a whole, potentially affecting the
       accuracy of generation.

Ideally, the mask store should not allow curly quotes after {“key. Can you also check what tokens are generated after this? Since each token can be a single bit it may have been one to three tokens to generate the curly quote. I want to see exactly what the prefix was when the error was made.

shubhamugare avatar Mar 21 '25 04:03 shubhamugare

I observed the LLM while it was generating the output, it generated all 3 bytes together, ie the entire curly quote. I also checked the tokenizer.json for Gemma3 and it does have the curly quote in its vocab, it does not have the individual bits as tokens (thankfully). We just need to figure out how we can restrict generation from the grammar, one way can be to add the character directly to .lark file in definition of NONEMPTY_STRING.

I also observed the entire generation in case of my prompt. Basically the model generates the rest of the json output and that too correctly, with the only exception being that it uses curly quotes and the non breaking space. The issue lies with the grammar parser thinking that the LLM is still generating the first nonempty_string (ie the string value of the first key).

HJJ256 avatar Mar 21 '25 06:03 HJJ256

Hi @HJJ256,

NONEMPTY_STRING: /\"[^"”“]+\"/ should work as a restriction to disallow using unicode curly quotes in the NONEMPTY_STRING after PR #172.

I'm hesitant about modifying the main syncode JSON grammar to enforce this restriction since it deviates from the JSON specification. For example:

{"hello ”shubham” here": "value"}

This is valid JSON according to the specs, and a grammar with /"[^"""]+"/ would reject it. I'm not sure how prevalent curly quotes are within JSON strings or how frequently models make this type of error. What have you observed regarding this issue, are these mistakes common enough to justify deviating from the standard specification?

shubhamugare avatar Mar 21 '25 18:03 shubhamugare

Hi @HJJ256,

NONEMPTY_STRING: /\"[^"”“]+\"/ should work as a restriction to disallow using unicode curly quotes in the NONEMPTY_STRING after PR #172.

I'm hesitant about modifying the main syncode JSON grammar to enforce this restriction since it deviates from the JSON specification. For example:

{"hello ”shubham” here": "value"}

This is valid JSON according to the specs, and a grammar with /"[^"""]+"/ would reject it. I'm not sure how prevalent curly quotes are within JSON strings or how frequently models make this type of error. What have you observed regarding this issue, are these mistakes common enough to justify deviating from the standard specification?

I am not sure about the commonality either since I have only encountered this case with Gemma3-12B-IT until now. I guess for the time being, we can keep the grammar generic and observe more models, and advise in the documentation, to use a user-defined grammar with these changes if necessary.

HJJ256 avatar Mar 21 '25 19:03 HJJ256

Makes sense. Let's keep it this way for now then.

shubhamugare avatar Mar 21 '25 19:03 shubhamugare

Hi @shubhamugare

When would this change be released?

HJJ256 avatar Mar 24 '25 17:03 HJJ256

Hi @HJJ256,

I can cut a release today.

I was also considering bumping the transformer dependency to the latest version before the next release, but that might take a few days.

shubhamugare avatar Mar 24 '25 18:03 shubhamugare