ttrss_plugin-feediron icon indicating copy to clipboard operation
ttrss_plugin-feediron copied to clipboard

Recursive multipage fetch error

Open rookiejet opened this issue 9 months ago • 3 comments

After #199 recursive multipage option fails for the following recipe

{
    "name": "arstechnica.com",
    "url": "arstechnica.com",
    "stamp": 1714614362,
    "author": "",
    "match": "arstechnica.com",
    "config": {
        "type": "xpath",
        "xpath": "div[contains(@class, 'article-content')]",
        "multipage": {
            "xpath": "nav[contains(@class, 'page-numbers')]\/span\/a[last()]",
            "append": true,
            "recursive": true
        },
        "modify": [
            {
                "type": "regex",
                "pattern": "\/<li.*? data-src=\"(.*?)\".*?>\\s*<figure.*?>.*?(?:<figcaption.*?<div class=\"caption\">(.*?)<\\\/div>.*?<\\\/figcaption>)?\\s*<\\\/figure>\\s*<\\\/li>\/s",
                "replace": "<figure><img src=\"$1\"\/><figcaption>$2<\/figcaption><\/figure>"
            }
        ],
        "cleanup": [
            "aside",
            "div[contains(@class, 'sidebar')]"
        ]
    }
}

Fetches for multipage articles using that recipe now error out with a memory exhaustion error.

[00:52:14/84489] guid 2,https://arstechnica.com/?p=2021143 (hash: {"ver":2,"uid":2,"hash":"SHA1:c9da0973a1
d72d22dc9e1d1c8a92524a07d0a847"} compat: SHA1:63f3a3eb0ea7ce00279023a4f8ff530c6350776f)
[00:52:14/84489] orig date: 1714601978 (2024-05-01 22:19:38)
[00:52:14/84489] title All the ways streaming services are aggravating their subscribers this week
[00:52:14/84489] link https://arstechnica.com/?p=2021143
[00:52:14/84489] language en
[00:52:14/84489] author Scharon Harding
[00:52:14/84489] looking for tags...
[00:52:14/84489] tags found: ads, comcast, nbcuniversal, netflix, peacock, sony, streaming, tech, warner bros. discovery
[00:52:14/84489] done collecting data.
[00:52:14/84489] looking for enclosures...
[00:52:14/84489] article hash: 0d2ba6fe8dff3f0d87a7174e8a00398543d691b1 [stored=]
[00:52:14/84489] hash differs, running HOOK_ARTICLE_FILTER handlers...
[00:52:14/84489] [UrlHelper] fetching: https://arstechnica.com/?p=2021143
[00:52:14/84489] [UrlHelper] fetching: https://arstechnica.com/gadgets/2024/05/all-the-ways-streaming-services-are-aggravating-their-subscribers-this-week/2/
[00:52:14/84489] [UrlHelper] fetching: https://arstechnica.com/gadgets/2024/05/all-the-ways-streaming-services-are-aggravating-their-subscribers-this-week/3/
PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 4096 bytes) in /srv/www/plugins.local/feediron/bin/fi_helper.php on line 67

I reverted the change to this line in commit https://github.com/feediron/ttrss_plugin-feediron/pull/199/commits/d0cd6a052cb78a96f40a366273ce5a3a765a6e95 and the above recipe is working again.

Article tested: https://arstechnica.com/gadgets/2024/05/all-the-ways-streaming-services-are-aggravating-their-subscribers-this-week/

rookiejet avatar May 02 '24 02:05 rookiejet

Ooof I did not catch that. Hope I'll be able to sit down this weekend to look at it fairly sure it's a matter for removing https://github.com/feediron/ttrss_plugin-feediron/blob/a586a2a96ceb4760d3c2e74d6968129f1b6caca8/init.php#L471 and iterating the maxpages counter.

@monofox any thoughts?

dugite-code avatar May 02 '24 07:05 dugite-code

I can reproduce it. But looking deeper into that part of links array, is fairly interesting. I'll try to understand the algorithm tomorrow evening. Because in this case, the link is not reformatted, as the option should be disabled per configuration. Just for you reference. Replacing line https://github.com/feediron/ttrss_plugin-feediron/blob/a586a2a96ceb4760d3c2e74d6968129f1b6caca8/init.php#L457 and reverting with non-referencing $lnk, will solve. But then the $links array have different value in case of executing reformat. Thats what i mean with understanding the correct algorithm.

monofox avatar May 02 '24 15:05 monofox

@dugite-code : i've made a proposal in #202

  • Manipulating the link during the foreach in a referencing way leaded to a perm loop of the foreach in case of recursive enabled.
  • Dropping the whole manipulating of the links array will stop loop, but also recursive mode.
  • So proposal is to place reformatting at beginning while ensuring, that only those links are re-processed, which are not seen yet.

Please let me know your opinion on this.

What i've faced during testing the different scenarios, that the counter and the links will not proper contain the right values while iterating in recursive mode. The counter is only increased in a next call. So if a page contains 4 links, all subsequent 4 links are considered as page 2. This might be an issue in proper taking care of $maxpages setting. Due to replacing, than merging and extending the $links-Array, the seenlinks is never containing all subsequent previous links, as the its replaced with its own. But this is not relevant here for solving this.

monofox avatar May 03 '24 19:05 monofox