Lexer outputs confused reminder when the terminal NUMBER prohibits trailing dot
version: v0.4.12
According to the JSON specification (RFC 8259), trailing dot is prohibited for floating point. However, when I define the new JSON grammar like below, I found the lexer's suspicious behavior.
// RFC 8259 without complex STRING definition
?start: value
?value: object
| array
| STRING
| NUMBER
| "true" -> true
| "false" -> false
| "null" -> null
object: "{" [member ("," member)*] "}"
member: STRING ":" value
array : "[" [value ("," value)*] "]"
NUMBER: MINUS? INT FRAC? EXP?
MINUS: "-"
INT: "0" | ("1".."9") DIGIT*
DIGIT: "0".."9"
FRAC: "." DIGIT+
EXP: ("e"|"E") ["+"|"-"] DIGIT+
STRING: /\"[^"]*\"/
WS: /[ \t\f\r\n]/+
%ignore WS
The observed behavior:
>>> grammar_engine._parse_partial_code(0, '{ "cap": 10.0', b'', accepted_generation=True)
(remainder : b'10.0', remainder_state: RemainderState.MAYBE_COMPLETE, accept_sequences: {accept_terminals: ['NUMBER', 'COMMA'], accept_terminals: ['NUMBER', 'WS', 'COMMA'], accept_terminals: ['LBRACE'], accept_terminals: ['WS'], accept_terminals: ['NULL'], accept_terminals: ['STRING'], accept_terminals: ['NUMBER', 'WS', 'RBRACE'], accept_terminals: ['NUMBER', 'RBRACE'], accept_terminals: ['TRUE'], accept_terminals: ['FALSE'], accept_terminals: ['LSQB']}, next_ac_indents: None, False)
# ↑ This looks correct.
>>> grammar_engine._parse_partial_code(0, '{ "cap": 10.', b'', accepted_generation=True)
(remainder : b'.', remainder_state: RemainderState.INCOMPLETE, accept_sequences: {accept_terminals: ['COMMA'], accept_terminals: ['WS'], accept_terminals: ['RBRACE']}, next_ac_indents: None, False)
# ↑ This reminder must be '10.' ?
It seems that the lexer will be confused when its state moves along accepted (digits) -> live-state (trailing dot) -> accepted (digits).
Can you try replacing NUMBER with number?
replacing NUMBER with number
I see. But it allows the number with whitespace inside like "10 .0". I'm considering an alternative rules about WS.
@shubhamugare Thank you for your suggestion. My workaround is below:
// Based on RFC 8259 without complex STRING definition
?start: value
_BEGIN_ARRAY: /[ \t\f\r\n]*\[[ \t\f\r\n]*/
_BEGIN_OBJECT: /[ \t\f\r\n]*\{[ \t\f\r\n]*/
_END_ARRAY: /[ \t\f\r\n]*\][ \t\f\r\n]*/
_END_OBJECT: /[ \t\f\r\n]*\}[ \t\f\r\n]*/
_NAME_SEPARATOR: /[ \t\f\r\n]*:[ \t\f\r\n]*/
_VALUE_SEPARATOR: /[ \t\f\r\n]*,[ \t\f\r\n]*/
?value: object
| array
| STRING
| number
| "true" -> true
| "false" -> false
| "null" -> null
object: _BEGIN_OBJECT [member (_VALUE_SEPARATOR member)*] _END_OBJECT
member: STRING _NAME_SEPARATOR value
array : _BEGIN_ARRAY [value (_VALUE_SEPARATOR value)*] _END_ARRAY
number: MINUS? INT FRAC? EXP?
MINUS: "-"
INT: "0" | ("1".."9") DIGIT*
DIGIT: "0".."9"
FRAC: "." DIGIT+
EXP: ("e"|"E") ["+"|"-"] DIGIT+
STRING: /\"[^"]*\"/