dom-parser icon indicating copy to clipboard operation
dom-parser copied to clipboard

tagRegExp hangs with certain URLs: Catastrophic backtracking?

Open Esowteric opened this issue 6 years ago • 4 comments
trafficstars

The following URL causes a node.js app to hang when matching using dom-parser.

DOM source from: https://www.ecosia.org/

I created a simple vanilla javascript match script and tested the web page source against tagRegExp, and the JS script also hung. Could this be catastrophic backtracking?

tagRegExp: /(</?[a-z][a-z0-9](?::[a-z][a-z0-9])?\s*(?:\s+[a-z0-9-_]+=(?:(?:'[\s\S]?')|(?:"[\s\S]?")))\s/?>)|([^<]|<(?![a-z/]))*/gi

Thanks.

Esowteric avatar Feb 05 '19 20:02 Esowteric

This is the script I used:

<script type="text/javascript">
var text = '... html source ...';
var text_esc = text
text_esc = text_esc.replace(/\</g, "&lt;");
text_esc = text_esc.replace(/\>/g, "&gt;");
var regex = /(<\/?[a-z][a-z0-9]*(?::[a-z][a-z0-9]*)?\s*(?:\s+[a-z0-9-_]+=(?:(?:'[\s\S]*?')|(?:"[\s\S]*?")))*\s*\/?>)|([^<]|<(?![a-z\/]))*/gi;
var found = text.match(regex);
var found_len = found.length;

document.write("Text: " + text_esc + "<br /><br />" + "Regex pattern: " + regex + "<br /><br />");

document.write("Matches: " + found_len + "<br /><br />");

for (var i=0;i<found_len;i++)
{
	found[i] = found[i].replace(/\</g, "&lt;");
	found[i] = found[i].replace(/\>/g, "&gt;");

	document.write("[" + i + "]: " + found[i] + "<br /><br />");
}
</script>

Esowteric avatar Feb 05 '19 20:02 Esowteric

The tagRegExp match is the first stage in the process, to pull out all tags from the DOM into an array, before looking for specific tags using getElementsByTagName, getAttribute, etc.

Esowteric avatar Feb 05 '19 21:02 Esowteric

Many thanks to Wiktor Stribiżew at Stack Overflow for this solution:

tagRegExp in /lib/Dom.js:

/(<\/?[a-z][a-z0-9]*(?::[a-z][a-z0-9]*)?\s*(?:\s+[a-z0-9-_]+=(?:'[^']*'|"[^"]*"))*\s*\/?>)|[^<]*(?:<(?![a-z\/])[^<]*)*/gi

See: https://stackoverflow.com/questions/54543223/node-js-dom-parser-tagregexp-regex-match-hangs-catastrophic-backtracking

Esowteric avatar Feb 06 '19 12:02 Esowteric

@Esowteric thx! I will integrate this solution soon.

ershov-konst avatar Apr 15 '19 08:04 ershov-konst