esprima-dotnet icon indicating copy to clipboard operation
esprima-dotnet copied to clipboard

Unexpected token ILLEGAL on regex \.

Open JarLob opened this issue 5 years ago • 13 comments

This code (regex with \.) triggers unexpected token: var isHtml = /\.html$/;

JarLob avatar Sep 27 '18 16:09 JarLob

Note to my future self: It works on http://esprima.org/demo/parse.html so the fix should be easy to find by just doing a step by step debug session and find out the difference between esprima and esprima-dotnet.

sebastienros avatar Sep 27 '18 16:09 sebastienros

Forgot to mention it is from tokenizer:

 	Esprima.dll!Esprima.ErrorHandler.ThrowError(int index = 427, int line = 15, int column = 18, string message = "Unexpected token ILLEGAL") Line 37	C#	Symbols loaded.
 	Esprima.dll!Esprima.Scanner.ThrowUnexpectedToken(string message = "Unexpected token ILLEGAL") Line 173	C#	Symbols loaded.
 	Esprima.dll!Esprima.Scanner.GetComplexIdentifier() Line 568	C#	Symbols loaded.
 	Esprima.dll!Esprima.Scanner.ScanIdentifier() Line 659	C#	Symbols loaded.
 	Esprima.dll!Esprima.Scanner.Lex() Line 1679	C#	Symbols loaded.
>	Esprima.Sample.dll!Esprima.Sample.Program.Tokenize(Esprima.Scanner scanner = {Esprima.Scanner}) Line 29	C#	Symbols loaded.

JarLob avatar Sep 27 '18 16:09 JarLob

Yes, I'm also very interested in fixing of lexer errors with regex. I described similar errors in https://github.com/sebastienros/esprima-dotnet/issues/40#issuecomment-423843683.

KvanTTT avatar Sep 27 '18 18:09 KvanTTT

I can't repro this issue. I added this test and it works fine both on master and dev. Can you provide a better unit test?

        [Fact]
        public void ShouldRegularExpressionGH44()
        {
            var parser = new JavaScriptParser(@"var isHtml = /\.html$/");
            var program = parser.ParseProgram();
        }

sebastienros avatar Sep 29 '18 17:09 sebastienros

I got it in Esprima.Sample as you can see from the call stack.

        [Fact]
        public void ShouldRegularExpressionGH44()
        {
            var scanner = new Scanner(@"var isHtml = /\.html$/");

            var   tokens = new List<Token>();
            Token token;

            do
            {
                scanner.ScanComments();
                token = scanner.Lex();
                tokens.Add(token);
            } while (token.Type != TokenType.EOF);
        }

JarLob avatar Sep 29 '18 17:09 JarLob

@sebastienros it does not work only with Scanner, not parser.

KvanTTT avatar Sep 29 '18 17:09 KvanTTT

I'm still having this issue with a regex that also contains \..

I cloned dev branch and ran the unit test posted by JarLob above, it still failed with both his regex and mine.

In my case I'm experiencing the problem via a Jint execution of a file containing the problematic regex, using Jint 3.0.0-beta-1715, with Esprima 1.0.1258

olliejm avatar Jun 25 '20 11:06 olliejm

Here is my particular regex:

var urlRegex = /(https?)\:\/\/[A-Za-z0-9\.\-]+(\/[A-Za-z0-9\?\&\=;\+!'\(\)\*\-\._~%]*)*/gi;

The error occurs on this line: https://github.com/sebastienros/esprima-dotnet/blob/dev/src/Esprima/Scanner.cs#L602 when processing the first \ escape character in the line;

olliejm avatar Jun 25 '20 13:06 olliejm

I can't repro these issues if I use the parser directly, or the ScanRegex method. I think that the issue is in the Esprima.Sample source that you all seem to be following. The parser does more than just call Lex and does some lookaheads that will help scan the regex. Otherwise it's trying to scan an identifier instead.

Is your intent to actually iterate over each Token of a script, like the sample is supposed to work?

sebastienros avatar Jun 27 '20 22:06 sebastienros

We use

var urlRegex = /(https?)\:\/\/[A-Za-z0-9\.\-]+(\/[A-Za-z0-9\?\&\=;\+!'\(\)\*\-\._~%]*)*/gi;

in a JInt script and that triggers the error in the Esprima dependency.

IntranetFactory avatar Jun 28 '20 06:06 IntranetFactory

Tested above regex with latest Jint 3 using REPL and worked just fine. Maybe someone should post a complete sample with used library versions.

lahma avatar Jun 29 '20 13:06 lahma

Here is the simplest failing program: /\./

This is the code from the sample project with that regex, and it is throwing the error in the title.

        public static void Main(string[] args)
        {
            var scanner = new Scanner(@"/\./");
            Tokenize(scanner);
        }

        private static void Tokenize(Scanner scanner)
        {
            var tokens = new List<Token>();
            Token token;

            do
            {
                scanner.ScanComments();
                token = scanner.Lex();
                tokens.Add(token);
            } while (token.Type != TokenType.EOF);

            Console.WriteLine(JsonConvert.SerializeObject(tokens, Formatting.Indented));
        }

To be clear, this script /\./ does not fail in Jint. I have a 500kb script that includes a lot of Regex like this so I think this may be causing it. My situation is a bit more weird, because that 500kb script normally does not fail, but fails when I create a Release build. Maybe Jint uses the Scanner when some optimizations are enabled?

Nevermind edit: It does not happen in Jint. It only happens in Scanner.

KurtGokhan avatar Jan 02 '21 21:01 KurtGokhan

Tracked this down: the root cause of the issue is the JS syntax itself, more precisely, the / and /= tokens, which are ambiguous. When the scanner encounters one of them, it can't tell whether it's the beginning of a regexp or it's a division operator without knowing the context. But to know the context, you need to parse the code... For more details, see the comments of this SO answer (please note that the accepted answer is wrong):

Technically, there are a couple ambiguities that are unavoidable at the lexical level. For example, (a+b)/c vs. if (x) /foo/.exec('bar') (close-paren can precede either). Also, ++ /foo/.abc and a++ / b (plus-plus can precede either). Together with -- these are the only ones I know of.

There's also a problem with }: function f() {}(newline)/1/g versus var x = {}(newline)/1/g, since the the latter doesn't enforce semicolon insertion.

To sum it up, you can't (reliably) use the scanner in standalone mode when the code contains regexps. Which situation is kinda sad but it looks there's no solution to this problem. What we can maybe do to mitigate it is to allow the user to provide some best effort algorithm, which would enable tokenization in some specific cases instead of failing with error. What do you think? Would such an addition make sense?

adams85 avatar Feb 25 '23 13:02 adams85