php-html-parser
php-html-parser copied to clipboard
DOM Cleaner: mb_eregi_replace errors out with retry-limit-in-match
Reproduction:
>>> use PHPHtmlParser\Dom;
>>> $dom = new Dom;
>>> $dom->loadFromUrl("https://casper.com/gifts/?clickid=T02U6OVQYxyLUbdwUx0Mo36dUkB1HNWwiSMnwQ0");
Throws:
PHP Warning: mb_eregi_replace(): mbregex search failure in php_mbereg_replace_exec(): retry-limit-in-match
over in <stripped>/paquettg/php-html-parser/src/PHPHtmlParser/Dom/Cleaner.php on line 81
PHPHtmlParser\Exceptions\LogicalException with message 'mb_eregi_replace returned false instead of a string.
Error when attempting to remove scripts 2.'
I've tried ini_set("pcre.backtrack_limit", "10000000000") after some Googlefu on the error, but it doesn't work.
I can reproduce this on pages with huge <script></script> tags, typically when there's a giant blob of JSON object in it.
I have the exact same problem but with a different URL. I quick-fixed it by disabling script removal from the HTML with $dom->setOptions((new Options())->setRemoveScripts(false)); but I would rather have a real fix for this, especially because there's a warning that keeping script tags could have unforeseen consequences.
Any help on this issue please @paquettg ?
Ok, I've fixed it without disabling tag removal by increasing the mb retry limit to 10 million. The self-documented php.ini describes this:
; This directive specifies maximum retry count for mbstring regular expressions. It is similar ; to the pcre.backtrack_limit for PCRE. ; Default: 1000000 ;mbstring.regex_retry_limit=1000000
so I've used
ini_set("mbstring.regex_retry_limit", "10000000");
and all works fine on this front now