syncode icon indicating copy to clipboard operation
syncode copied to clipboard

Lexer outputs confused reminder when the terminal NUMBER prohibits trailing dot

Open Yosshi999 opened this issue 8 months ago • 3 comments

version: v0.4.12

According to the JSON specification (RFC 8259), trailing dot is prohibited for floating point. However, when I define the new JSON grammar like below, I found the lexer's suspicious behavior.

// RFC 8259 without complex STRING definition
?start: value

?value: object
| array
| STRING
| NUMBER
| "true"             -> true
| "false"            -> false
| "null"             -> null

object: "{" [member ("," member)*] "}"
member: STRING ":" value
array : "[" [value ("," value)*] "]"

NUMBER: MINUS? INT FRAC? EXP?
MINUS: "-"
INT: "0" | ("1".."9") DIGIT*
DIGIT: "0".."9"
FRAC: "." DIGIT+
EXP: ("e"|"E") ["+"|"-"] DIGIT+

STRING: /\"[^"]*\"/
WS: /[ \t\f\r\n]/+

%ignore WS

The observed behavior:

>>> grammar_engine._parse_partial_code(0, '{ "cap": 10.0', b'', accepted_generation=True)
(remainder : b'10.0', remainder_state: RemainderState.MAYBE_COMPLETE, accept_sequences: {accept_terminals: ['NUMBER', 'COMMA'], accept_terminals: ['NUMBER', 'WS', 'COMMA'], accept_terminals: ['LBRACE'], accept_terminals: ['WS'], accept_terminals: ['NULL'], accept_terminals: ['STRING'], accept_terminals: ['NUMBER', 'WS', 'RBRACE'], accept_terminals: ['NUMBER', 'RBRACE'], accept_terminals: ['TRUE'], accept_terminals: ['FALSE'], accept_terminals: ['LSQB']}, next_ac_indents: None, False)

# ↑ This looks correct.

>>> grammar_engine._parse_partial_code(0, '{ "cap": 10.', b'', accepted_generation=True)
(remainder : b'.', remainder_state: RemainderState.INCOMPLETE, accept_sequences: {accept_terminals: ['COMMA'], accept_terminals: ['WS'], accept_terminals: ['RBRACE']}, next_ac_indents: None, False)

# ↑ This reminder must be '10.' ?

It seems that the lexer will be confused when its state moves along accepted (digits) -> live-state (trailing dot) -> accepted (digits).

Yosshi999 avatar Apr 23 '25 07:04 Yosshi999

Can you try replacing NUMBER with number?

shubhamugare avatar Apr 23 '25 22:04 shubhamugare

replacing NUMBER with number

I see. But it allows the number with whitespace inside like "10 .0". I'm considering an alternative rules about WS.

Yosshi999 avatar Apr 24 '25 01:04 Yosshi999

@shubhamugare Thank you for your suggestion. My workaround is below:

// Based on RFC 8259 without complex STRING definition
?start: value
    
_BEGIN_ARRAY:     /[ \t\f\r\n]*\[[ \t\f\r\n]*/
_BEGIN_OBJECT:    /[ \t\f\r\n]*\{[ \t\f\r\n]*/
_END_ARRAY:       /[ \t\f\r\n]*\][ \t\f\r\n]*/
_END_OBJECT:      /[ \t\f\r\n]*\}[ \t\f\r\n]*/
_NAME_SEPARATOR:  /[ \t\f\r\n]*:[ \t\f\r\n]*/
_VALUE_SEPARATOR: /[ \t\f\r\n]*,[ \t\f\r\n]*/
        
?value: object
| array 
| STRING
| number
| "true"             -> true
| "false"            -> false
| "null"             -> null
        
object: _BEGIN_OBJECT [member (_VALUE_SEPARATOR member)*] _END_OBJECT
member: STRING _NAME_SEPARATOR value
array : _BEGIN_ARRAY [value (_VALUE_SEPARATOR value)*] _END_ARRAY
    
number: MINUS? INT FRAC? EXP?
MINUS: "-"
INT: "0" | ("1".."9") DIGIT*
DIGIT: "0".."9"
FRAC: "." DIGIT+
EXP: ("e"|"E") ["+"|"-"] DIGIT+
        
STRING: /\"[^"]*\"/

Yosshi999 avatar Apr 24 '25 07:04 Yosshi999