htmlpurifier icon indicating copy to clipboard operation
htmlpurifier copied to clipboard

fix: catastrophic backtracking in Core.AggressivelyFixLt

Open bytestream opened this issue 8 months ago • 0 comments

When provided with a large HTML document (over a million characters) the Core.AggressivelyFixLt regex results in catastrophic backtracking and $html = null being returned. TLDR; HTMLPurifier gives you back a null document...

I tried many times to produce a regex which did not suffer from catastrophic backtracking but I think it ultimately comes back to the argument of why you should not use regex to parse HTML. The only solutions I could come up with were to either:

  • Increase pcre.backtrack_limit to a higher value
  • Disable Core.AggressivelyFixLt but that's sub-optimal given the approach seems to work on documents of a reasonable size...
  • Handle the null return value from preg_replace_callback and return $html (disable armor logic if a regex error occurs)

The solution in this PR uses a little algorithm which employs only standard string manipulation functions so it works incredibly fast. The algorithm searches for HTML comments and allows a callback to be ran on them.

I've not messed with the signatures of the callbackUndoCommentSubst and callbackArmorCommentEntities functions because they're public and might be used by other libraries.

bytestream avatar Feb 07 '25 13:02 bytestream