php-html-parser icon indicating copy to clipboard operation
php-html-parser copied to clipboard

DOM Cleaner: mb_eregi_replace errors out with retry-limit-in-match

Open half0wl opened this issue 4 years ago • 2 comments

Reproduction:

>>> use PHPHtmlParser\Dom;
>>> $dom = new Dom;
>>> $dom->loadFromUrl("https://casper.com/gifts/?clickid=T02U6OVQYxyLUbdwUx0Mo36dUkB1HNWwiSMnwQ0");

Throws:

PHP Warning:  mb_eregi_replace(): mbregex search failure in php_mbereg_replace_exec(): retry-limit-in-match
over in <stripped>/paquettg/php-html-parser/src/PHPHtmlParser/Dom/Cleaner.php on line 81
PHPHtmlParser\Exceptions\LogicalException with message 'mb_eregi_replace returned false instead of a string.
Error when attempting to remove scripts 2.'

I've tried ini_set("pcre.backtrack_limit", "10000000000") after some Googlefu on the error, but it doesn't work.

I can reproduce this on pages with huge <script></script> tags, typically when there's a giant blob of JSON object in it.

half0wl avatar Jul 27 '21 02:07 half0wl

I have the exact same problem but with a different URL. I quick-fixed it by disabling script removal from the HTML with $dom->setOptions((new Options())->setRemoveScripts(false)); but I would rather have a real fix for this, especially because there's a warning that keeping script tags could have unforeseen consequences.

Any help on this issue please @paquettg ?

Deewde avatar Jan 28 '22 11:01 Deewde

Ok, I've fixed it without disabling tag removal by increasing the mb retry limit to 10 million. The self-documented php.ini describes this:

; This directive specifies maximum retry count for mbstring regular expressions. It is similar ; to the pcre.backtrack_limit for PCRE. ; Default: 1000000 ;mbstring.regex_retry_limit=1000000

so I've used

ini_set("mbstring.regex_retry_limit", "10000000");

and all works fine on this front now

Deewde avatar Jan 28 '22 12:01 Deewde