newspaper4k icon indicating copy to clipboard operation
newspaper4k copied to clipboard

Iterating over multiple runs - no new articles in spite of memoize=False

Open AndyTheFactory opened this issue 2 years ago • 9 comments

Issue by tomthebuzz Wed Aug 8 14:30:59 2018 Originally opened as https://github.com/codelucas/newspaper/issues/605


Am getting towards the end of my wisdom: Whenever I manually start a new run over a portfolio of 10 sources and processing the articles I seem to be getting the correct number of articles. If however I execute the same code within an iterating "while TRUE:" loop with a 30-60 minute wait state and with the deletion of the original np.build() variables and with memoize_articles=False (in both Build() and Article() statements) I always seem to be getting only the articles from the initial run, no matter if the source has published new articles within the waiting time or not.

Anyone have made similar experiences and found a workable solution?

AndyTheFactory avatar Oct 24 '23 12:10 AndyTheFactory

Comment by naivelogic Mon Aug 27 02:14:17 2018


Yes, i am currently having the same problem. the only workable solution i could find was to manually go to the cache location of the feeds ~/newspaper_scraper/feed_category_cache and remove the files. i have yet to develop a solution to do this in the py function. Hope this helps.

AndyTheFactory avatar Oct 24 '23 12:10 AndyTheFactory

Comment by codelucas Mon Aug 27 07:07:48 2018


Thanks for filing this @tomthebuzz and also @naivelogic.

If what you guys are reporting is true then this seems to be a serious bug. I will try to reproduce but can you two also share the precise commands you ran to get this issue so I can verify? + what OS are you using

AndyTheFactory avatar Oct 24 '23 12:10 AndyTheFactory

Comment by tomthebuzz Mon Aug 27 08:32:27 2018


Hi Lucas,

unfortunately it does not reproduce constantly. While iterating in 10min intervals I have 7-8 out 10 that show this behavior and 2-3 that work as expected. It has improved somewhat since I have included catching download() errors via try / except. Will continue to monitor and revert as soon as I can report something more enlightening.

Cheers -Tom


Tom Debus

Managing Partner

Integration Alpha GmbH Fabrikstrasse 5 6330 Cham Switzerland

mobile: +41 79 335 38 42 email: [email protected]

www.integrationalpha.com

On 27 Aug 2018, at 09:08, Lucas Ou-Yang [email protected] wrote:

Thanks for filing this @tomthebuzz and also @naivelogic.

If what you guys are reporting is true then this seems to be a serious bug. I will try to reproduce but can you two also share the precise commands you ran to get this issue so I can verify? + what OS are you using

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

AndyTheFactory avatar Oct 24 '23 12:10 AndyTheFactory

Comment by codelucas Mon Sep 3 06:17:50 2018


The memoization behavior is beginning to get very annoying since a lot of users are reporting issues with the api out of confusion, indicating the api is not perfect

The way newspaper has handled memoizing content since the start was to cache previously scraped articles on disk and not re-scrape them mostly because a few newspaper.build() calls on the same website will get you rate limited/banned since the heavy load of requests. Sure, we can let the users/callers themselves do the caching, but the library is already late in design phase and it's late for a big change like that.

I still think memoizing content should be default, but maybe we can force in logging.info statements whenever the memoizing happens so it's very clear when articles are cached/not cached

AndyTheFactory avatar Oct 24 '23 12:10 AndyTheFactory

Comment by naivelogic Thu Sep 6 21:32:35 2018


Hey Lucas, patron my response delay, I do like the memorization functionality bc it limits the amount of processing required. I'm glad the feature is there, because I would had to manually created such a function. However, the caching seems to be the root of the problem where we arent able to interate over a list of URLs.

To remediation this issues, similar to tom's approach, the fix that has sufficiency worked for me is as follows:

import os
cache_to_remove = '/home/<insert user name>/.newspaper_scraper/feed_category_cache/f3b78688afc588cf439322fd84aca09a805e8a6f'

# removal of article from cache that is included in scrapper function
try: os.remove(cache_to_remove)
    except OSError: pass

AndyTheFactory avatar Oct 24 '23 12:10 AndyTheFactory

Comment by codelucas Mon Sep 10 23:08:07 2018


Thanks for your thoughts @naivelogic

In newspaper/utils.py we have a function available for clearing the cache per news source. Check it out and please suggest improvements in this cache cleaning API

https://github.com/codelucas/newspaper/blob/master/newspaper/utils.py#L273-L280

judging based on the reports from you and @tomthebuzz, perhaps there is a bug where even memoize_articles is False, there are still things getting cached when they shouldn't be..

Alternatively, since none of this is deterministic (given that the html scraping portion can return a 404 or 500 error or even a rate limit if the news site feels you are scraping too much) We don't know if the 7 out of 10 times is due to the memoizing behavior having a bug or if the remote news site is returning different data

AndyTheFactory avatar Oct 24 '23 12:10 AndyTheFactory

Comment by agnelvishal Tue Nov 20 12:08:47 2018


Since the error is not deterministic, is multi-threading causing this problem?

AndyTheFactory avatar Oct 24 '23 12:10 AndyTheFactory

Comment by ghost Sat Apr 3 21:49:26 2021


When this happened for me it was due to rate limiting and being blocked by the sites.

AndyTheFactory avatar Oct 24 '23 12:10 AndyTheFactory

I may have found an issue in the code related to this. I am running into the same problem where I do not get articles returned after calling build() after an initial successful run. I believe the memoize_articles method is returning the wrong list of articles.
return list(cur_articles.values())

should instead be return list(valid_urls.values())

the cur_articles is the list of articles that have not been added to cache. I have not tested this yet to see if it fixes the issue. But this would explain why the only solution is to delete the cache. https://github.com/AndyTheFactory/newspaper4k/blob/c5e4170918a6d1e99cb1bab6fd188ee8ed5a2afa/newspaper/utils/init.py#L121

MikePone avatar Jan 29 '25 06:01 MikePone