htmlpurifier
htmlpurifier copied to clipboard
fix: catastrophic backtracking in Core.AggressivelyFixLt
When provided with a large HTML document (over a million characters) the Core.AggressivelyFixLt regex results in catastrophic backtracking and $html = null being returned. TLDR; HTMLPurifier gives you back a null document...
I tried many times to produce a regex which did not suffer from catastrophic backtracking but I think it ultimately comes back to the argument of why you should not use regex to parse HTML. The only solutions I could come up with were to either:
- Increase
pcre.backtrack_limitto a higher value - Disable
Core.AggressivelyFixLtbut that's sub-optimal given the approach seems to work on documents of a reasonable size... - Handle the
nullreturn value frompreg_replace_callbackand return$html(disable armor logic if a regex error occurs)
The solution in this PR uses a little algorithm which employs only standard string manipulation functions so it works incredibly fast. The algorithm searches for HTML comments and allows a callback to be ran on them.
I've not messed with the signatures of the callbackUndoCommentSubst and callbackArmorCommentEntities functions because they're public and might be used by other libraries.