antlr4 Numbers and strings cannot be recognized correctly

trafficstars

I have a syntax file designed as follows:

grammar CusJson;

DOT                 : '.' ;
QUOTA               : '"' ;

prog                : value (',' value)* EOF;

value               : number | string;

number              : INTEGER (DOT DIGIT+)? EXP?;

INTEGER             : '-'? INT;
fragment INT        : '0' | [1-9] DIGIT*;
EXP                 : [Ee] [+\-]? DIGIT+ ;
DIGIT               : [0-9] ;

// when processing strings, expect to be able to recognize the esc
// and you can discard the quotes directly instead of using ‘text.substring(1, text.length() - 1)’
// if 'string' rules are lexical rules 'string', they will not meet the above needs
string              : QUOTA (ESC | STRING_SEQ)* QUOTA;

ESC                 : '\\' (["\\/bfnrt] | UNICODE) ;
STRING_SEQ          : SAFECODEPOINT+;
fragment UNICODE    : 'u' HEX HEX HEX HEX;
fragment HEX        : [0-9a-fA-F];
fragment SAFECODEPOINT      : ~["\\\u0000-\u001F];

WS                  : [ \t\n\r]+ -> skip;

Then for the following text:

"1","s d",567,",，haha"

1 of the "1" is identified as INTEGER and 567 is identified as STRING_SEQ

Dec 21 '23 07:12 GeTOUO

'1' is a DIGIT not a SAFECODEPOINT hence things start going south very rapidly. You probably want to add DOT and DIGIT in your string rule. Then your STRING_SEQ wins over ',' because it can consume more chars. The culprit is having string as a parser rule rather than a lexer rule. Look at examples in grammars repo.

Dec 21 '23 07:12 ericvergnaud

I've designed 'string' as a syntax rule so that 'esc' and 'STRING SEQ' can be handled separately when processed in visitor. No other solution has been found

Dec 21 '23 10:12 GeTOUO

I would recommend giving up on dealing with those details in the parser, the visitor or the listener. At first glance, your string literal is a double quoted sequence of any char except the double quote itself, which can be escaped. You need a token for that. Dealing with escaped sequences within the string chars seems simple enough to do later.

Dec 21 '23 12:12 ericvergnaud

I would recommend giving up on dealing with those details in the parser, the visitor or the listener. At first glance, your string literal is a double quoted sequence of any char except the double quote itself, which can be escaped. You need a token for that. Dealing with escaped sequences within the string chars seems simple enough to do later.

Thanks to @ericvergnaud's suggestion, I'm going to go to the grammar library for inspiration, and if there's no good way I'll have to choose lexical rules

Dec 22 '23 01:12 GeTOUO

antlr4 antlr4 copied to clipboard

Numbers and strings cannot be recognized correctly

antlr4
antlr4 copied to clipboard