vscode-textmate icon indicating copy to clipboard operation
vscode-textmate copied to clipboard

0-infinite quantifier inside look behind capture group causes tremendous performance loss

Open RedCMD opened this issue 3 years ago • 5 comments

Creating a syntax highlighter with the following code and running it on a single line file filled with thousands of the letter c will cause tremendous lag image

{
	"name": "Lag Syntax",
	"scopeName": "source.redcmd.syntax.lag",
	"patterns": [
		{
			"match": "(?<=ab*)c",
			"name": "strong keyword.control"
		}
	]
}

Replacing the * with a + or changing the regex to (?<=ab+|a)c will remove all lag and allow my machine to easily interpret a million character single line file But then a file filled with bc will lag instead

I cant seem to figure out how the engine matches lookbehinds does it start with the c first then try to match the lookbehind afterwards? or does it try to match the lookbehind first, then the c?

using {0,10000} instead of * reduces lag massively which seems to suggest that the engine is trying to match b an infinite amount (or an internal limit) of times before giving up but for some reason it only does so when its allowed to match 0 times?? maybe its an underflow error?

But then using the regex (?<=a(b+)?)c also causes an enormous amount of lag which I don't get unless.. (b+)? gets optmized to b*

or maybe cause the engine backtracks in the wrong direction and ends up endlessly rechecking b?

Noticed single line files were only getting partially highlighted image Dug around and found out it was caused by this regex: image (?<={\\d*)\\\\{2} is the part that was going over board on every single \\

RedCMD avatar Jan 10 '22 09:01 RedCMD

This might be causing some of the serious performance issues in the C++ syntax as well.

Part of our preprocessing steps in the C++ grammar optimize (b+)? to become b* so that's probably not doing us any favors

jeff-hykin avatar Jan 11 '22 20:01 jeff-hykin

I just looked through all c, cpp, cpp.embed and cuda-cpp grammar files and did not see this specfic issue once.

RedCMD avatar Jan 11 '22 23:01 RedCMD

Oh, well thanks for saving me the trouble @RedCMD. I suppose it's just regular catastrophic backtracking then for C++

jeff-hykin avatar Jan 12 '22 17:01 jeff-hykin

@jeff-hykin This may be causing a lot of the C++ performance issues I notice many "#inline_comment" includes under captures Capturing and applying a pattern causes performance loss: https://github.com/microsoft/vscode-textmate/issues/167

RedCMD avatar Jan 25 '22 06:01 RedCMD

Thank you for reaching out! I think the regex (?<=ab*)c might be running into catastrophic backtracking. Unfortunately, I don't know enough about the inner workings of a regex engine to help deeper than that. We are using oniguruma as the underlying regular expression engine. The author of oniguruma has been incredibly helpful and responsive in the past.

@kkos do you think there are optimization opportunities in oniguruma around such regular expressions?

alexdima avatar Jan 25 '22 19:01 alexdima