WordPress lazy-loading noscript cleaner broken with libxml2 < 2.9.9
With libxml2 2.9.4 (included in Ubuntu 18.04 LTS), Graby’s WordPress lazy-loading noscript cleaner is unable to remove the second image in the noscript text:
<p><img data-lazyloaded="1" src="data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSI2MzkiIGhlaWdodD0iNDA4IiB2aWV3Qm94PSIwIDAgNjM5IDQwOCI+PHJlY3Qgd2lkdGg9IjEwMCUiIGhlaWdodD0iMTAwJSIgZmlsbD0iI2NmZDRkYiIvPjwvc3ZnPg==" class="aligncenter size-full wp-image-32079" data-src="https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png" alt="" width="639" height="408" /><noscript><img class="aligncenter size-full wp-image-32079" src="https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png" alt="" width="639" height="408" /></noscript></p>
is turned into:
<p><img data-lazyloaded="1" src="https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png" class="aligncenter size-full wp-image-32079" alt="" width="639" height="408" /></p><noscript>
<p><img class="aligncenter size-full wp-image-32079" src="https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png" alt="" width="639" height="408" /></p>
It works fine with libxml2 2.9.10 in later versions of Ubuntu, it was likely fixed by https://gitlab.gnome.org/GNOME/libxml2/-/commit/35e83488505d501864826125cfe6a7950d6cba78.
You can reproduce this by running
$ git clone https://github.com/jtojnar/graby-double-images && cd graby-double-images
$ composer install
$ php test.php
on system with libxml2 before 2.9.9, or if you have Nix:
$ $nix-shell --run 'composer install && php test.php'
See https://github.com/fossar/selfoss/issues/1230 for more details.
At this point I see these possible solutions:
- Recommend to use
html5libinstead oflibxmlbut not sure how performant it is. - Try to find out if it is possible to make libxml parse the
noscriptinsidepcorrectly. - Make the
ContentExtractorlook fornoscriptto parent node’s sibling as well. - Ask Ubuntu and other distros to backport the patch since it is trivial,
- Do nothing, ask users to upgrade. But Ubuntu 18.04 is supported at least until April 2023 :crying_cat_face:
There is also a separate bug in tidy that wraps the img in the noscript in a p, resulting in invalid p > noscript > p nesting but that does not seem to cause issues thanks to another libxml2 bug :woman_shrugging:
Apparently, html5lib suffers from this even worse, even with https://github.com/j0k3r/php-readability/pull/60. I thought it might use libxml2 internally but it happens on libxml2 2.9.10 as well:
$graby = new Graby([
'extractor' => [
'default_parser' => 'html5lib',
'allowed_parsers' => ['html5lib'], // Without this it would still use libxml
]
], new GuzzleAdapter());