Iterating over multiple runs - no new articles in spite of memoize=False
Issue by tomthebuzz
Wed Aug 8 14:30:59 2018
Originally opened as https://github.com/codelucas/newspaper/issues/605
Am getting towards the end of my wisdom: Whenever I manually start a new run over a portfolio of 10 sources and processing the articles I seem to be getting the correct number of articles. If however I execute the same code within an iterating "while TRUE:" loop with a 30-60 minute wait state and with the deletion of the original np.build() variables and with memoize_articles=False (in both Build() and Article() statements) I always seem to be getting only the articles from the initial run, no matter if the source has published new articles within the waiting time or not.
Anyone have made similar experiences and found a workable solution?
Comment by naivelogic
Mon Aug 27 02:14:17 2018
Yes, i am currently having the same problem. the only workable solution i could find was to manually go to the cache location of the feeds ~/newspaper_scraper/feed_category_cache and remove the files. i have yet to develop a solution to do this in the py function. Hope this helps.
Comment by codelucas
Mon Aug 27 07:07:48 2018
Thanks for filing this @tomthebuzz and also @naivelogic.
If what you guys are reporting is true then this seems to be a serious bug. I will try to reproduce but can you two also share the precise commands you ran to get this issue so I can verify? + what OS are you using
Comment by tomthebuzz
Mon Aug 27 08:32:27 2018
Hi Lucas,
unfortunately it does not reproduce constantly. While iterating in 10min intervals I have 7-8 out 10 that show this behavior and 2-3 that work as expected. It has improved somewhat since I have included catching download() errors via try / except. Will continue to monitor and revert as soon as I can report something more enlightening.
Cheers -Tom
Tom Debus
Managing Partner
Integration Alpha GmbH Fabrikstrasse 5 6330 Cham Switzerland
mobile: +41 79 335 38 42 email: [email protected]
www.integrationalpha.com
On 27 Aug 2018, at 09:08, Lucas Ou-Yang [email protected] wrote:
Thanks for filing this @tomthebuzz and also @naivelogic.
If what you guys are reporting is true then this seems to be a serious bug. I will try to reproduce but can you two also share the precise commands you ran to get this issue so I can verify? + what OS are you using
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Comment by codelucas
Mon Sep 3 06:17:50 2018
The memoization behavior is beginning to get very annoying since a lot of users are reporting issues with the api out of confusion, indicating the api is not perfect
The way newspaper has handled memoizing content since the start was to cache previously scraped articles on disk and not re-scrape them mostly because a few newspaper.build() calls on the same website will get you rate limited/banned since the heavy load of requests. Sure, we can let the users/callers themselves do the caching, but the library is already late in design phase and it's late for a big change like that.
I still think memoizing content should be default, but maybe we can force in logging.info statements whenever the memoizing happens so it's very clear when articles are cached/not cached
Comment by naivelogic
Thu Sep 6 21:32:35 2018
Hey Lucas, patron my response delay, I do like the memorization functionality bc it limits the amount of processing required. I'm glad the feature is there, because I would had to manually created such a function. However, the caching seems to be the root of the problem where we arent able to interate over a list of URLs.
To remediation this issues, similar to tom's approach, the fix that has sufficiency worked for me is as follows:
import os
cache_to_remove = '/home/<insert user name>/.newspaper_scraper/feed_category_cache/f3b78688afc588cf439322fd84aca09a805e8a6f'
# removal of article from cache that is included in scrapper function
try: os.remove(cache_to_remove)
except OSError: pass
Comment by codelucas
Mon Sep 10 23:08:07 2018
Thanks for your thoughts @naivelogic
In newspaper/utils.py we have a function available for clearing the cache per news source. Check it out and please suggest improvements in this cache cleaning API
https://github.com/codelucas/newspaper/blob/master/newspaper/utils.py#L273-L280
judging based on the reports from you and @tomthebuzz, perhaps there is a bug where even memoize_articles is False, there are still things getting cached when they shouldn't be..
Alternatively, since none of this is deterministic (given that the html scraping portion can return a 404 or 500 error or even a rate limit if the news site feels you are scraping too much) We don't know if the 7 out of 10 times is due to the memoizing behavior having a bug or if the remote news site is returning different data
Comment by agnelvishal
Tue Nov 20 12:08:47 2018
Since the error is not deterministic, is multi-threading causing this problem?
Comment by ghost
Sat Apr 3 21:49:26 2021
When this happened for me it was due to rate limiting and being blocked by the sites.
I may have found an issue in the code related to this. I am running into the same problem where I do not get articles returned after calling build() after an initial successful run. I believe the memoize_articles method is returning the wrong list of articles.
return list(cur_articles.values())
should instead be
return list(valid_urls.values())
the cur_articles is the list of articles that have not been added to cache. I have not tested this yet to see if it fixes the issue. But this would explain why the only solution is to delete the cache. https://github.com/AndyTheFactory/newspaper4k/blob/c5e4170918a6d1e99cb1bab6fd188ee8ed5a2afa/newspaper/utils/init.py#L121