graby icon indicating copy to clipboard operation
graby copied to clipboard

WordPress lazy-loading noscript cleaner broken with libxml2 < 2.9.9

Open jtojnar opened this issue 5 years ago • 3 comments

With libxml2 2.9.4 (included in Ubuntu 18.04 LTS), Graby’s WordPress lazy-loading noscript cleaner is unable to remove the second image in the noscript text:

<p><img data-lazyloaded="1" src="data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSI2MzkiIGhlaWdodD0iNDA4IiB2aWV3Qm94PSIwIDAgNjM5IDQwOCI+PHJlY3Qgd2lkdGg9IjEwMCUiIGhlaWdodD0iMTAwJSIgZmlsbD0iI2NmZDRkYiIvPjwvc3ZnPg==" class="aligncenter size-full wp-image-32079" data-src="https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png" alt="" width="639" height="408" /><noscript><img class="aligncenter size-full wp-image-32079" src="https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png" alt="" width="639" height="408" /></noscript></p>

is turned into:

<p><img data-lazyloaded="1" src="https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png" class="aligncenter size-full wp-image-32079" alt="" width="639" height="408" /></p><noscript>
<p><img class="aligncenter size-full wp-image-32079" src="https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png" alt="" width="639" height="408" /></p>

It works fine with libxml2 2.9.10 in later versions of Ubuntu, it was likely fixed by https://gitlab.gnome.org/GNOME/libxml2/-/commit/35e83488505d501864826125cfe6a7950d6cba78.

You can reproduce this by running

$ git clone https://github.com/jtojnar/graby-double-images && cd graby-double-images
$ composer install
$ php test.php

on system with libxml2 before 2.9.9, or if you have Nix:

$ $nix-shell --run 'composer install && php test.php'

See https://github.com/fossar/selfoss/issues/1230 for more details.

jtojnar avatar Nov 13 '20 20:11 jtojnar

At this point I see these possible solutions:

  • Recommend to use html5lib instead of libxml but not sure how performant it is.
  • Try to find out if it is possible to make libxml parse the noscript inside p correctly.
  • Make the ContentExtractor look for noscript to parent node’s sibling as well.
  • Ask Ubuntu and other distros to backport the patch since it is trivial,
  • Do nothing, ask users to upgrade. But Ubuntu 18.04 is supported at least until April 2023 :crying_cat_face:

jtojnar avatar Nov 13 '20 21:11 jtojnar

There is also a separate bug in tidy that wraps the img in the noscript in a p, resulting in invalid p > noscript > p nesting but that does not seem to cause issues thanks to another libxml2 bug :woman_shrugging:

jtojnar avatar Nov 13 '20 21:11 jtojnar

Apparently, html5lib suffers from this even worse, even with https://github.com/j0k3r/php-readability/pull/60. I thought it might use libxml2 internally but it happens on libxml2 2.9.10 as well:

$graby = new Graby([
	'extractor' => [
		'default_parser' => 'html5lib',
		'allowed_parsers' => ['html5lib'], // Without this it would still use libxml
	]
], new GuzzleAdapter());

jtojnar avatar Nov 16 '20 12:11 jtojnar