html5-php icon indicating copy to clipboard operation
html5-php copied to clipboard

Parsing document with a lot of HTML tags is slow

Open alecpl opened this issue 5 years ago • 12 comments

I have a script that generates a HTML sample that is ~1.5MB in size. It emulates a real-world example. Then I parse it.

$html = '<HTML><BODY>';
$lines = 20000;
while ($lines--) {
    $html .= '<P DIR=LTR><SPAN LANG="en-gb"><FONT FACE="Consolas">&gt;&gt; </FONT></SPAN></P>';
}

$html5 = new Masterminds\HTML5();
$node  = $html5->loadHTML($html);

and here's the result:

PHP Fatal error:  Maximum execution time of 120 seconds exceeded in vendor/masterminds/html5/src/HTML5/Parser/DOMTreeBuilder.php on line 433
PHP Stack trace:
PHP   1. {main}() test.php:0
PHP   2. Masterminds\HTML5->loadHTML() test.php:23
PHP   3. Masterminds\HTML5->parse() vendor/masterminds/html5/src/HTML5.php:98
PHP   4. Masterminds\HTML5\Parser\Tokenizer->parse() vendor/masterminds/html5/src/HTML5.php:174
PHP   5. Masterminds\HTML5\Parser\Tokenizer->consumeData() vendor/masterminds/html5/src/HTML5/Parser/Tokenizer.php:89
PHP   6. Masterminds\HTML5\Parser\Tokenizer->tagOpen() vendor/masterminds/html5/src/HTML5/Parser/Tokenizer.php:132
PHP   7. Masterminds\HTML5\Parser\Tokenizer->tagName() vendor/masterminds/html5/src/HTML5/Parser/Tokenizer.php:284
PHP   8. Masterminds\HTML5\Parser\DOMTreeBuilder->startTag() vendor/masterminds/html5/src/HTML5/Parser/Tokenizer.php:388

I tested this with 2.7.0 and some older versions with no success. The sample half of that size works, but it takes 27 seconds to finish (so it's not linear).

Cross-ref: https://github.com/roundcube/roundcubemail/issues/7331

alecpl avatar Apr 16 '20 18:04 alecpl

Have you tried to debug it with backfire or some other profiler?

goetas avatar Apr 18 '20 04:04 goetas

I didn't yet, but I can add that the specific content is not that important, the number of tags is. So, it looks like this library has problem with parsing big HTML pages. FYI, DOMDocument parses the sample in less than a second.

alecpl avatar Apr 18 '20 06:04 alecpl

I'm not sure how useful is that, but here's xdebug profile on a smaller sample. Sorry, for Polish language, but forcing English in KCacheGrind didn't work. xdebug

alecpl avatar Apr 18 '20 07:04 alecpl

So, it looks like a DOMElement::appendChild() is the main bottleneck. Here's some performance stats showing how number of tags makes a difference. PHP 7.4.

Tags  |  Time
---------------
10k   |   1.3s
20k   |   3.3s
30k   |   7.9s
40k   |  16.4s
50k   |  28.3s

alecpl avatar Apr 19 '20 07:04 alecpl

can you try to benchmark appendChild alone and see if that slows down after a certain number of tags?

goetas avatar Apr 19 '20 07:04 goetas

Nope, and it's the other way round (more tags, better time per tag). What's more the following script is blazingly fast (<1sec).

$doc = new DOMDocument;
$body = $doc->createElement("body");
$doc->appendChild($body);
$lines = 100000;
while ($lines--) {
    $p = $doc->createElement("p");
    $body->appendChild($p);
    $span = $doc->createElement("span");
    $p->appendChild($span);
    $font = $doc->createElement("font");
    $span->appendChild($font);
}

alecpl avatar Apr 19 '20 07:04 alecpl

image

goetas avatar Apr 19 '20 07:04 goetas

Hmm, weird...

goetas avatar Apr 19 '20 09:04 goetas

~the bottleneck seems to be autoclose()..., by removing that, the script completes in 3s~ NVM

goetas avatar Jun 14 '20 09:06 goetas

This turned out to be a PHP issue that can be workedaroud by doing

$html5 = new Masterminds\HTML5([
    'disable_html_ns' => true
]);
$node  = $html5->loadHTML($html);

The perf issue was introduced by https://github.com/php/php-src/blob/35e0a91db717fe441a89ca9554d8843d8ee63112/ext/dom/php_dom.c and https://github.com/php/php-src/commit/84b90f639d09f002ed50c87877b62615e928b88b

goetas avatar Jun 14 '20 13:06 goetas

Thanks for the workaround. With it my initial test script takes 8 seconds, not that bad. DOMDocument needs 0.3 second.

Did you already create a ticket in PHP's bugtracker?

alecpl avatar Jul 26 '20 09:07 alecpl

Was listed by xhprof with PHP 8.3.2-1. Is this a thing or should I look other places?

Screenshot 2024-02-23 at 12 29 38

steinmb avatar Feb 23 '24 11:02 steinmb