htmlpurifier
htmlpurifier copied to clipboard
Inconsistent output across PHP versions
I've noticed a strange inconsistency with html purifier. Any ideas what this could be related to?
Script:
<?php
require __DIR__ . '/../vendor/autoload.php';
$html = <<<HTML
<h1>Test</h1>
<h2>Test 2</h2>
<p>This is a paragraph
This is a new line
Another new line</p>
<ul>
<li>bullet</li>
<li>bullet 2</li>
</ul>
<p><img src="imagesrc.png" alt="img" /></p>
<p><a href="https://www.google.com">Hyperlink</a></p>
HTML;
$output = (new HTMLPurifier)->purify($html);
echo md5($output) . PHP_EOL . $output;
Ubuntu 18.04.4 LTS (Dev VM) PHP 7.4.8 (cli) (built: Jul 13 2020 16:45:47) ( NTS ) Copyright (c) The PHP Group Zend Engine v3.4.0, Copyright (c) Zend Technologies with Zend OPcache v7.4.8, Copyright (c), by Zend Technologies
<h1>Test</h1>
<h2>Test 2</h2>
<p>This is a paragraph
This is a new line
Another new line</p>
<ul><li>bullet</li>
<li>bullet 2</li>
</ul><p><img src="imagesrc.png" alt="img" /></p>
(md5 3966db7c2db30e0e63f566ac4a01632d)
--
Ubuntu 18.04.4 LTS (Dev VM) PHP 7.4.10 (cli) (built: Sep 9 2020 06:36:14) ( NTS ) Copyright (c) The PHP Group Zend Engine v3.4.0, Copyright (c) Zend Technologies with Zend OPcache v7.4.10, Copyright (c), by Zend Technologies
<h1>Test</h1>
<h2>Test 2</h2>
<p>This is a paragraph
This is a new line
Another new line</p>
<ul>
<li>bullet</li>
<li>bullet 2</li>
</ul>
<p><img src="imagesrc.png" alt="img" /></p>
<p><a href="https://www.google.com">Hyperlink</a></p>
(md5 f4b6f3065f0adb5ae6ab3e45f2380586)
--
Ubuntu 18.04.5 LTS (CI Server) PHP 7.4.10 (cli) (built: Sep 22 2020 10:00:08) ( NTS ) Copyright (c) The PHP Group Zend Engine v3.4.0, Copyright (c) Zend Technologies
<h1>Test</h1>
<h2>Test 2</h2>
<p>This is a paragraph
This is a new line
Another new line</p>
<ul><li>bullet</li>
<li>bullet 2</li>
</ul><p><img src="imagesrc.png" alt="img" /></p>
<p><a href="https://www.google.com">Hyperlink</a></p>
(md5 3966db7c2db30e0e63f566ac4a01632d)
Usually it's due to differences in the version of libxml shipped with PHP, which we use to do parsing.
The 2 differing PHP versions give the same output for php -i | grep 'libxml'
. Could they still be different?
PHP 7.4.10 (cli) (built: Sep 9 2020 06:36:14) ( NTS ) Copyright (c) The PHP Group Zend Engine v3.4.0, Copyright (c) Zend Technologies with Zend OPcache v7.4.10, Copyright (c), by Zend Technologies
libxml Version => 2.9.10
libxml
libxml2 Version => 2.9.10
libxslt compiled against libxml Version => 2.9.4
PHP 7.4.8 (cli) (built: Jul 13 2020 16:45:47) ( NTS ) Copyright (c) The PHP Group Zend Engine v3.4.0, Copyright (c) Zend Technologies with Zend OPcache v7.4.8, Copyright (c), by Zend Technologies
libxml Version => 2.9.10
libxml
libxml2 Version => 2.9.10
libxslt compiled against libxml Version => 2.9.4
Oh, that is fairly strange. If you want to try debugging this, try printing the intermediate html after each html purifier phase and try to localize where the difference shows up.