htmlpurifier icon indicating copy to clipboard operation
htmlpurifier copied to clipboard

Inconsistent output across PHP versions

Open liamkeily opened this issue 4 years ago • 3 comments

I've noticed a strange inconsistency with html purifier. Any ideas what this could be related to?

Script:

<?php
require __DIR__ . '/../vendor/autoload.php';

$html = <<<HTML
<h1>Test</h1>
<h2>Test 2</h2>
<p>This is a paragraph
This is a new line
Another new line</p>
<ul>
<li>bullet</li>
<li>bullet 2</li>
</ul>
<p><img src="imagesrc.png" alt="img" /></p>
<p><a href="https://www.google.com">Hyperlink</a></p>
HTML;

$output = (new HTMLPurifier)->purify($html);
echo md5($output) . PHP_EOL . $output;

Ubuntu 18.04.4 LTS (Dev VM) PHP 7.4.8 (cli) (built: Jul 13 2020 16:45:47) ( NTS ) Copyright (c) The PHP Group Zend Engine v3.4.0, Copyright (c) Zend Technologies with Zend OPcache v7.4.8, Copyright (c), by Zend Technologies

<h1>Test</h1>
<h2>Test 2</h2>
<p>This is a paragraph
This is a new line
Another new line</p>
<ul><li>bullet</li>
<li>bullet 2</li>
</ul><p><img src="imagesrc.png" alt="img" /></p>

(md5 3966db7c2db30e0e63f566ac4a01632d)

--

Ubuntu 18.04.4 LTS (Dev VM) PHP 7.4.10 (cli) (built: Sep 9 2020 06:36:14) ( NTS ) Copyright (c) The PHP Group Zend Engine v3.4.0, Copyright (c) Zend Technologies with Zend OPcache v7.4.10, Copyright (c), by Zend Technologies

<h1>Test</h1>
<h2>Test 2</h2>
<p>This is a paragraph
This is a new line
Another new line</p>
<ul>
<li>bullet</li>
<li>bullet 2</li>
</ul>
<p><img src="imagesrc.png" alt="img" /></p>
<p><a href="https://www.google.com">Hyperlink</a></p>

(md5 f4b6f3065f0adb5ae6ab3e45f2380586)

--

Ubuntu 18.04.5 LTS (CI Server) PHP 7.4.10 (cli) (built: Sep 22 2020 10:00:08) ( NTS ) Copyright (c) The PHP Group Zend Engine v3.4.0, Copyright (c) Zend Technologies

<h1>Test</h1>
<h2>Test 2</h2>
<p>This is a paragraph
This is a new line
Another new line</p>
<ul><li>bullet</li>
<li>bullet 2</li>
</ul><p><img src="imagesrc.png" alt="img" /></p>
<p><a href="https://www.google.com">Hyperlink</a></p>

(md5 3966db7c2db30e0e63f566ac4a01632d)

liamkeily avatar Sep 24 '20 10:09 liamkeily

Usually it's due to differences in the version of libxml shipped with PHP, which we use to do parsing.

ezyang avatar Sep 24 '20 14:09 ezyang

The 2 differing PHP versions give the same output for php -i | grep 'libxml'. Could they still be different?

PHP 7.4.10 (cli) (built: Sep 9 2020 06:36:14) ( NTS ) Copyright (c) The PHP Group Zend Engine v3.4.0, Copyright (c) Zend Technologies with Zend OPcache v7.4.10, Copyright (c), by Zend Technologies

libxml Version => 2.9.10
libxml
libxml2 Version => 2.9.10
libxslt compiled against libxml Version => 2.9.4

PHP 7.4.8 (cli) (built: Jul 13 2020 16:45:47) ( NTS ) Copyright (c) The PHP Group Zend Engine v3.4.0, Copyright (c) Zend Technologies with Zend OPcache v7.4.8, Copyright (c), by Zend Technologies


libxml Version => 2.9.10
libxml
libxml2 Version => 2.9.10
libxslt compiled against libxml Version => 2.9.4

liamkeily avatar Sep 24 '20 14:09 liamkeily

Oh, that is fairly strange. If you want to try debugging this, try printing the intermediate html after each html purifier phase and try to localize where the difference shows up.

ezyang avatar Sep 24 '20 21:09 ezyang