selfoss icon indicating copy to clipboard operation
selfoss copied to clipboard

Duplicate images

Open davidoskky opened this issue 3 years ago • 34 comments

I use the fulltext spout and in some feeds every image on the page is shown twice. This feed for reference: https://uxmovement.com/feed/

davidoskky avatar Nov 11 '20 10:11 davidoskky

I have the same and need to look into it.

niol avatar Nov 11 '20 10:11 niol

Hmm, weird. I cannot reproduce this on master with the provided feed and fulltextrss spout. What version of selfoss are you running?

jtojnar avatar Nov 11 '20 14:11 jtojnar

I am sorry, I wasn't on the latest release but on 16f5d68. I upgraded to c87e14a, removed that feed and added it back again to make sure it wasn't cached and the problem persists. Is there any test I might do that might help you determine the problem?

davidoskky avatar Nov 11 '20 16:11 davidoskky

There were only translation and CI updates in the meanwhile so that should not have any effect.

Could you set logger_level=DEBUG in config.ini and possibly change zero to two in https://github.com/fossar/selfoss/blob/c87e14aaba42abe281ca3c859a0006fb292ff22c/src/common.php#L28 and then share the log when updating the single source?

jtojnar avatar Nov 11 '20 17:11 jtojnar

Here it is: https://pastebin.com/7wubSz0V

davidoskky avatar Nov 11 '20 17:11 davidoskky

This is what is on the web page:

<p><img data-lazyloaded=\"1\" src=\"data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSI1NzIiIGhlaWdodD0iNDA4IiB2aWV3Qm94PSIwIDAgNTcyIDQwOCI+PHJlY3Qgd2lkdGg9IjEwMCUiIGhlaWdodD0iMTAwJSIgZmlsbD0iI2NmZDRkYiIvPjwvc3ZnPg==\" class=\"aligncenter size-full wp-image-32075\" data-src=\"https://uxmovement.com/wp-content/uploads/2020/11/layout-beforeafter.png\" alt=\"\" width=\"572\" height=\"408\" /><noscript><img class=\"aligncenter size-full wp-image-32075\" src=\"https://uxmovement.com/wp-content/uploads/2020/11/layout-beforeafter.png\" alt=\"\" width=\"572\" height=\"408\" /></noscript></p>

This is what I see in the database for the Strong layout hierarchy article:

<p><img src="https://uxmovement.com/wp-content/uploads/2020/11/layout-beforeafter.png" alt="" width="572" height="408" /></p>

Looks like for some reason filtering is not triggered for you. Could you please look into the database and see what it gets turned into?

jtojnar avatar Nov 11 '20 17:11 jtojnar

In my case the image is twice in the content field in the database. In the source site, the image is also present twice, but the second time it is wrapped in a noscript tag. This is the same as the Strong layout hierarchy article. The noscript tag is gone in the database probably because of sanitizeContent(). Whitelisting noscript should be the simple fix for this.

niol avatar Nov 11 '20 18:11 niol

I would expect sanitizeContent to remove noscript completely, even with its content like it seems to do for me. It is super weird that there is a difference in behaviour, even though we presumably have the same set of composer libraries. I have PHP 7.4.3 with the following modules:

$ php -m
[PHP Modules]
bcmath
calendar
Core
ctype
curl
date
dom
exif
FFI
fileinfo
filter
ftp
gd
gettext
hash
iconv
intl
json
libxml
mbstring
mysqli
mysqlnd
openssl
pcntl
pcre
PDO
pdo_mysql
pdo_pgsql
pdo_sqlite
pgsql
Phar
posix
readline
Reflection
session
shmop
SimpleXML
soap
sockets
sodium
SPL
sqlite3
standard
sysvmsg
sysvsem
sysvshm
tidy
tokenizer
xml
xmlreader
xmlwriter
xsl
Zend OPcache
zip
zlib

[Zend Modules]
Zend OPcache

jtojnar avatar Nov 11 '20 19:11 jtojnar

In my case that image is twice in the database but there is no noscript tag.

<p><img src="https://uxmovement.com/wp-content/uploads/2020/11/layout-beforeafter.png" alt="" width="572" height="408" /></p>
<p><img src="https://uxmovement.com/wp-content/uploads/2020/11/layout-beforeafter.png" alt="" width="572" height="408" /></p>

davidoskky avatar Nov 11 '20 19:11 davidoskky

I have php 7.2.24 and I do not have FFI, intl, pdo_pgsql, pgsql and soap modules.

davidoskky avatar Nov 11 '20 19:11 davidoskky

By running the following PHP script in the selfoss directory, I determined that the noscript elements are still present in Graby’s DOM after site config stripping phase but not in Body after cleanupHtml, before cleanupXss.

<?php

require __DIR__ . '/src/common.php';


// print logs just to terminal
$handler = new ErrorLogHandler(ErrorLogHandler::OPERATING_SYSTEM, 'debug');
$handler->setFormatter($formatter);
$log->popHandler();
$log->pushHandler($handler);

$spout = $dice->create(spouts\rss\fulltextrss::class);

$params = [
  'url' => 'https://uxmovement.com/feed/',
];
$spout->load($params);
foreach ($spout as $item) {
  $item->getContent();
  break;
}

jtojnar avatar Nov 12 '20 01:11 jtojnar

I ran the script, but the noscript elements are still present even in the phase Body after cleanupHtml, before cleanupXss. Full log: https://pastebin.com/fmUYkic7

davidoskky avatar Nov 12 '20 09:11 davidoskky

After playing a bit with cleanupHtml method, trying to reproduce it here, it does not seem to come from there.

<?php
require __DIR__ . '/src/common.php';
$doc = new DOMDocument();
$doc->loadHTML("<p><img data-lazyloaded=\"1\" src=\"data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSI1NzIiIGhlaWdodD0iNDA4IiB2aWV3Qm94PSIwIDAgNTcyIDQwOCI+PHJlY3Qgd2lkdGg9IjEwMCUiIGhlaWdodD0iMTAwJSIgZmlsbD0iI2NmZDRkYiIvPjwvc3ZnPg==\" class=\"aligncenter size-full wp-image-32075\" data-src=\"https://uxmovement.com/wp-content/uploads/2020/11/layout-beforeafter.png\" alt=\"\" width=\"572\" height=\"408\" /><noscript><img class=\"aligncenter size-full wp-image-32075\" src=\"https://uxmovement.com/wp-content/uploads/2020/11/layout-beforeafter.png\" alt=\"\" width=\"572\" height=\"408\" /></noscript></p>");
$contentBlock = $doc->documentElement->getElementsByTagName('p')->item(0);
$readability = $dice->create(Readability\Readability::class);
$readability->clean($contentBlock, 'select');
$contentBlock->normalize();
$html = $contentBlock->ownerDocument->saveXML($contentBlock);
echo $html;

So it must come from doFetchContent.

jtojnar avatar Nov 12 '20 12:11 jtojnar

Right, the bad code is last seen at https://github.com/j0k3r/graby/blob/2.0.0-alpha.0/src/Extractor/ContentExtractor.php#L374 called by https://github.com/j0k3r/graby/blob/2.0.0-alpha.0/src/Graby.php#L325. And in fact 300 lines lower in the process monster method, there is a code that should remove the noscript elements:

https://github.com/j0k3r/graby/blob/2.0.0-alpha.0/src/Extractor/ContentExtractor.php#L618-L630

jtojnar avatar Nov 12 '20 12:11 jtojnar

At this point I would start sprinkling var_dump('something'); calls around the code in vendor/j0k3r/graby/src/Extractor/ContentExtractor.php to narrow down why is not the noscript code run when running the following:

<?php

require __DIR__ . '/src/common.php';

use Graby\Graby;
use Http\Adapter\Guzzle6\Client as GuzzleAdapter;
use Monolog\Handler\ErrorLogHandler;

// print logs just to terminal
$handler = new ErrorLogHandler(ErrorLogHandler::OPERATING_SYSTEM, 'debug');
$handler->setFormatter($formatter);
$log->popHandler();
$log->pushHandler($handler);

$webClient = $dice->create(helpers\WebClient::class);
$url = "https://uxmovement.com/content/strong-layout-hierarchy-reduces-content-overload/";
$graby = new Graby([
    'extractor' => [
        'config_builder' => [
            'site_config' => [\F3::get('ftrss_custom_data_dir')],
        ],
    ],
], new GuzzleAdapter($webClient->getHttpClient()));
// $graby->setLogger($log);
echo $graby->fetchContent($url)['html'];

jtojnar avatar Nov 12 '20 13:11 jtojnar

Reading your log more closely, DOM after site config stripping is still unprocessed but Body after cleanupHtml, before cleanupXss actually turns the lazy-loaded image into a real one, it is just that the image also becomes wrapped in p element for you between those two code locations so the noscript cleaner cannot remove it.

jtojnar avatar Nov 12 '20 13:11 jtojnar

Actually, the wrapped images first appear in Body after Readability.

jtojnar avatar Nov 12 '20 13:11 jtojnar

Unless there is difference in tidy settings, it must be in readability:

--- a/vendor/j0k3r/graby/src/Extractor/ContentExtractor.php
+++ b/vendor/j0k3r/graby/src/Extractor/ContentExtractor.php
@@ -198,7 +198,7 @@
             unset($count);
         }
 
-        $this->logger->debug('HTML after site config strings replacements', ['html' => $html]);
+        $this->logger->debug('HTML after site config strings replacements', ['html' => $html]); // here the images are unwrapped for you still
 
         // load and parse html
         $parser = $this->siteConfig->parser();
@@ -208,13 +208,16 @@
             $parser = $this->config['default_parser'];
         }
 
-        $this->logger->info('Attempting to parse HTML with {parser}', ['parser' => $parser]);
+        $this->logger->info('Attempting to parse HTML with {parser}', ['parser' => $parser]); // here you have same libxml as me
 
         $this->readability = $this->getReadability($html, $url, $parser, $this->siteConfig->tidy() && $smartTidy);
         $tidied = $this->readability->tidied;
+        var_dump(['sc-tidy' => $this->siteConfig->tidy()]); // true for me
+        var_dump(['smarttidy' => $smartTidy]); // true for me
+        var_dump(['tidied' => $tidied]); // false for me
 
         $this->logger->info('Body size after Readability: {length}', ['length' => \strlen($this->readability->dom->saveXML())]);
-        $this->logger->debug('Body after Readability', ['dom_saveXML' => $this->readability->dom->saveXML()]);
+        $this->logger->debug('Body after Readability', ['dom_saveXML' => $this->readability->dom->saveXML()]); // here the images are already wrapped for you!!!
 
         // we use xpath to find elements in the given HTML document
         $this->xpath = new \DOMXPath($this->readability->dom);
@@ -313,6 +316,7 @@
         }
 
         // strip elements (using xpath expressions)
+        var_dump(['strip' => $this->siteConfig->strip]);
         foreach ($this->siteConfig->strip as $pattern) {
             $this->logger->info('Trying {pattern} to strip element', ['pattern' => $pattern]);
             $elems = $this->xpath->query($pattern, $this->readability->dom);
@@ -342,6 +346,7 @@
         }
 
         // strip images (using src attribute values)
+        var_dump(['strip_image_src' => $this->siteConfig->strip_image_src]);
         foreach ($this->siteConfig->strip_image_src as $string) {
             $string = strtr($string, ["'" => '', '"' => '']);
 
@@ -559,7 +564,7 @@
         }
 
         if ($detectBody && $success) {
-            $this->logger->info('Detecting body');
+            $this->logger->info('Detecting body'); // this gets printed in your logs
             $this->body = $this->readability->getContent();
 
             if (1 === $this->body->childNodes->length && XML_ELEMENT_NODE === $this->body->firstChild->nodeType) {
@@ -573,6 +578,7 @@
             }
         }
 
+        // var_dump(['bodySet' => $this->body]);
         if (isset($this->body)) {
             // remove any h1-h6 elements that appear as first thing in the body
             // and which match our title
@@ -603,6 +609,7 @@
             }
 
             // remove image lazy loading
+            var_dump(['lazy_load_attributes' => $this->config['src_lazy_load_attributes']]);
             foreach ($this->body->getElementsByTagName('img') as $e) {
                 $hasAttribute = false;
                 foreach ($this->config['src_lazy_load_attributes'] as $attribute) {
@@ -611,6 +618,7 @@
                     }
                 }
 
+                var_dump(['hasAttribute' => $hasAttribute]);
                 if (false === $hasAttribute) {
                     continue;
                 }
@@ -620,6 +628,7 @@
                 // inside the data-lazy-src attribute. It also places the original image inside a noscript element
                 // next to the amended one.
                 // @see https://plugins.trac.wordpress.org/browser/lazy-load/trunk/lazy-load.php
+                var_dump(['nextSibling' => $e->nextSibling !== null ? $e->nextSibling->nodeName : 'null']);
                 if (null !== $e->nextSibling && 'noscript' === $e->nextSibling->nodeName) {
                     $newElem = $e->ownerDocument->createDocumentFragment();
                     $newElem->appendXML($e->nextSibling->innerHTML);
@@ -660,7 +669,7 @@
             );
         }
 
-        $this->logger->info('Success ? {is_success}', ['is_success' => $this->success]);
+        $this->logger->info('Success ? {is_success}', ['is_success' => $this->success]); // This gets prited as well.
 
         return $this->success;
     }

jtojnar avatar Nov 12 '20 13:11 jtojnar

This code is particularly suspicious:

https://github.com/j0k3r/php-readability/blob/c6425cc28bba80967f13d8a87b97386838500c5a/src/Readability.php#L997-L1011

jtojnar avatar Nov 12 '20 14:11 jtojnar

Here is the output after running that script: https://pastebin.com/zKj9VVbb I do not get anything in the logs though, should I also give you the logs after refreshing the source? In my case tidied is true.

davidoskky avatar Nov 12 '20 17:11 davidoskky

I commented out the $graby->setLogger($log); line in the code above so that the output does not get lost in the noise.

It would be useful to uncomment it and try to enable logging for readability itself:

--- a/vendor/j0k3r/graby/src/Extractor/ContentExtractor.php
+++ b/vendor/j0k3r/graby/src/Extractor/ContentExtractor.php
@@ -1008,7 +1008,7 @@
      */
     private function getReadability($html, $url, $parser, $enableTidy)
     {
-        $readability = new Readability($html, $url, $parser, $enableTidy);
+        $readability = new Readability($html, $url, $parser, $enableTidy, $this->logger);
 
         if (isset($this->config['readability']['pre_filters']) && \is_array($this->config['readability']['pre_filters'])) {
             foreach ($this->config['readability']['pre_filters'] as $filter => $replacer) {
--- a/vendor/j0k3r/php-readability/src/Readability.php
+++ b/vendor/j0k3r/php-readability/src/Readability.php
@@ -177,14 +177,14 @@
      * @param string $parser   Which parser to use for turning raw HTML into a DOMDocument
      * @param bool   $use_tidy Use tidy
      */
-    public function __construct($html, $url = null, $parser = 'libxml', $use_tidy = true)
+    public function __construct($html, $url = null, $parser = 'libxml', $use_tidy = true, $logger = null)
     {
         $this->url = $url;
         $this->html = $html;
         $this->parser = $parser;
         $this->useTidy = $use_tidy && \function_exists('tidy_parse_string');
 
-        $this->logger = new NullLogger();
+        $this->logger = $logger->withName('php-readability') ?: new NullLogger();
         $this->loadHtml();
     }
 

For me there are not really many php-readability logs before selfoss.DEBUG: Body after Readability:

[2020-11-12 19:05:02] php-readability.DEBUG: Parsing URL: https://uxmovement.com/content/strong-layout-hierarchy-reduces-content-overload/  

[2020-11-12 19:05:02] php-readability.DEBUG: Tidying document  

jtojnar avatar Nov 12 '20 19:11 jtojnar

Also turns out the tidied === true is not it either, I installed php-tidy in my work directory and it makes no difference (I had it on the web server before anyway).

jtojnar avatar Nov 12 '20 19:11 jtojnar

Same in my case, not many readability logs before that one: https://pastebin.com/htYqhYhC

davidoskky avatar Nov 12 '20 22:11 davidoskky

Looks like https://uxmovement.com/wp-content/uploads/2020/02/wireframe-tout.png is actually not wrapped in p but https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png is – the latter is not followed by </p> in HTML after site config strings replacements but is followed by it in Body after Readability.

We are running out of options, must be one of these options:

--- a/vendor/j0k3r/php-readability/src/Readability.php
+++ b/vendor/j0k3r/php-readability/src/Readability.php
@@ -1391,6 +1391,7 @@
         mb_regex_encoding('UTF-8');
 
         // HACK: dirty cleanup to replace some stuff; shouldn't use regexps with HTML but well...
+        var_dump(['pre_filters' => $this->pre_filters]);
         if (!$this->flagIsActive(self::FLAG_DISABLE_PREFILTER)) {
             foreach ($this->pre_filters as $search => $replace) {
                 $this->html = preg_replace($search, $replace, $this->html);
@@ -1410,7 +1411,10 @@
         if ($this->useTidy) {
             $this->logger->debug('Tidying document');
 
+            var_dump(['tidy_config' => $this->tidy_config]);
+            var_dump(['pre-tidy' => $this->html]);
             $tidy = tidy_repair_string($this->html, $this->tidy_config, 'UTF8');
+            var_dump(['post-tidy' => $tidy]);
             if (false !== $tidy && $this->html !== $tidy) {
                 $this->tidied = true;
                 $this->html = $tidy;
@@ -1431,6 +1435,7 @@
             $this->dom = new \DOMDocument();
             $this->dom->preserveWhiteSpace = false;
             $this->dom->loadHTML($this->html, LIBXML_NOBLANKS | LIBXML_COMPACT | LIBXML_NOERROR);
+            var_dump(['post-load' => $this->dom->saveHTML()]);
 
             libxml_use_internal_errors(false);
         }

I would bet on the but it is weird that it would behave differently for us. ~~Even though it is consistent for me on Nix with tidy 5.6.0 and Ubuntu.~~ Ubuntu has that same version.

jtojnar avatar Nov 12 '20 23:11 jtojnar

Here is the log https://pastebin.com/EKcxR5sB The problem does in fact appear to happen after tidy. I have installed the package php7.2-tidy from Ubuntu repository.

davidoskky avatar Nov 12 '20 23:11 davidoskky

Looking at your config it is actually only the post-load dump that adds the </p> (look at the first instance of https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png in each text).

In fact diffing our dumps for these three steps show that only loading is different – tidy works the same for us.

jtojnar avatar Nov 12 '20 23:11 jtojnar

Looking at the post-tidy HTML in validator, it seems that your libxml is probably behaving correctly.

<noscript> is supposed to have transparent content model, that is, only elements that would be allowed if the <noscript> element was not there are allowed. Since the <noscript> is inside <p>, it cannot contain <p> so when it occurs in there, the previous </p> is closed (which must be before <noscript> to avoid opening and closing tag crossing).

My parser seems to be unaware that paragraphs cannot contain paragraphs so it parses as if it were well-formed XML document.

Could you try to echo LIBXML_DOTTED_VERSION; in the script? I have 2.9.10.

jtojnar avatar Nov 13 '20 00:11 jtojnar

I get 2.9.4

davidoskky avatar Nov 13 '20 00:11 davidoskky

So we actually have two bugs:

  • Tidy wraps p > noscript > img producing incorrect p > noscript > p > img. We both have this bug.
  • libxml is parsing HTML inconsistently between versions.

jtojnar avatar Nov 13 '20 00:11 jtojnar

Seems to be something specific to our config, I can reproduce the tidy bug with:

<?php
$tidy_config = [
    'tidy-mark' => false,
    'vertical-space' => false,
    'doctype' => 'omit',
    'numeric-entities' => false,
    // 'preserve-entities' => true,
    'break-before-br' => false,
    'clean' => true,
    'output-xhtml' => true,
    'logical-emphasis' => true,
    'show-body-only' => false,
    'new-blocklevel-tags' => 'article aside audio bdi canvas details dialog figcaption figure footer header hgroup main menu menuitem nav section source summary template track video',
    'new-empty-tags' => 'command embed keygen source track wbr',
    'new-inline-tags' => 'audio command datalist embed keygen mark menuitem meter output progress source time video wbr',
    'wrap' => 0,
    'drop-empty-paras' => true,
    'drop-proprietary-attributes' => false,
    'enclose-text' => true,
    'enclose-block-text' => true,
    'merge-divs' => true,
    // 'merge-spans' => true,
    'input-encoding' => '????',
    'output-encoding' => 'utf8',
    'hide-comments' => true,
];
$html = '<p><img src="foo.png" alt="" /><noscript><img src="foo.png" alt="" /></noscript></p>';
var_dump(tidy_repair_string($html, $tidy_config, 'UTF8'));

If I use empty config, it does not add the incorrect paragraph.

Edit: Looks like it is the enclose-block-text.

jtojnar avatar Nov 13 '20 00:11 jtojnar