antlr4
antlr4 copied to clipboard
Numbers and strings cannot be recognized correctly
I have a syntax file designed as follows:
grammar CusJson;
DOT : '.' ;
QUOTA : '"' ;
prog : value (',' value)* EOF;
value : number | string;
number : INTEGER (DOT DIGIT+)? EXP?;
INTEGER : '-'? INT;
fragment INT : '0' | [1-9] DIGIT*;
EXP : [Ee] [+\-]? DIGIT+ ;
DIGIT : [0-9] ;
// when processing strings, expect to be able to recognize the esc
// and you can discard the quotes directly instead of using ‘text.substring(1, text.length() - 1)’
// if 'string' rules are lexical rules 'string', they will not meet the above needs
string : QUOTA (ESC | STRING_SEQ)* QUOTA;
ESC : '\\' (["\\/bfnrt] | UNICODE) ;
STRING_SEQ : SAFECODEPOINT+;
fragment UNICODE : 'u' HEX HEX HEX HEX;
fragment HEX : [0-9a-fA-F];
fragment SAFECODEPOINT : ~["\\\u0000-\u001F];
WS : [ \t\n\r]+ -> skip;
Then for the following text:
"1","s d",567,",,haha"
1 of the "1" is identified as INTEGER and 567 is identified as STRING_SEQ
'1' is a DIGIT not a SAFECODEPOINT hence things start going south very rapidly. You probably want to add DOT and DIGIT in your string rule. Then your STRING_SEQ wins over ',' because it can consume more chars. The culprit is having string as a parser rule rather than a lexer rule. Look at examples in grammars repo.
I've designed 'string' as a syntax rule so that 'esc' and 'STRING SEQ' can be handled separately when processed in visitor. No other solution has been found
I would recommend giving up on dealing with those details in the parser, the visitor or the listener. At first glance, your string literal is a double quoted sequence of any char except the double quote itself, which can be escaped. You need a token for that. Dealing with escaped sequences within the string chars seems simple enough to do later.
I would recommend giving up on dealing with those details in the parser, the visitor or the listener. At first glance, your string literal is a double quoted sequence of any char except the double quote itself, which can be escaped. You need a token for that. Dealing with escaped sequences within the string chars seems simple enough to do later.
Thanks to @ericvergnaud's suggestion, I'm going to go to the grammar library for inspiration, and if there's no good way I'll have to choose lexical rules