newspaper icon indicating copy to clipboard operation
newspaper copied to clipboard

After downloading a few hundred articles it mass fails

Open steeljardas opened this issue 2 years ago • 14 comments

So I am using newspaper3k to mass download articles while scraping Google, I noticed that after a couple of hours of downloading hundreds of different articles it continuously gives me an error when doing article.parse() because the article was not downloaded. This happens to every single URL from that point onwards until I wait for a little bit then if I restart the scraping after waiting 5-10 mins it works again.

What could be the issue?

steeljardas avatar Dec 30 '21 00:12 steeljardas

Probably Google or an intermediary is temporarily banning the IP.

banagale avatar Dec 30 '21 00:12 banagale

Probably Google or an intermediary is temporarily banning the IP.

Google isn't banning it because I'm getting the links from google still, however, newspaper isn't being able to download them. Or at least not being able to parse them since that's what's triggering the errors. (In fact I usually check for H2 tags before parsing and it actually manages to get them but once I try parsing it triggers errors.

steeljardas avatar Dec 30 '21 02:12 steeljardas

There could be several problems. Can you share your code?

johnbumgarner avatar Dec 30 '21 03:12 johnbumgarner

There could be several problems. Can you share your code?

here: https://pastebin.com/uAH8Mx2s

It's a bit messy but essentially it googles the keyword, grabs the 10 links, goes into each of them and downloads them using newspaper.

then uses beautiful soup to grab the H2, however, when the issue I mention in the OP happens, it keeps erroring out when doing article.parse()

steeljardas avatar Dec 30 '21 12:12 steeljardas

So based on your code you are querying Google via search -- https://google.com/search?q={query2}

This methodology will throw errors with Newspaper3k and the BeautifulSoup. I would recommend adding some error handling in your code.

Here is my Stack Overflow answer on error handling with Newspaper3k.

https://stackoverflow.com/questions/69728117/newspaper3k-filter-out-bad-url-while-extracting/69729136#69729136

Take a look at this for handling soup errors -

https://www.tutorialspoint.com/beautiful_soup/beautiful_soup_trouble_shooting.htm

I would also recommend adding a random sleep function

from time import sleep
from random import randint

# this sleep timer is helping with some timeout issues
# that were happening when querying
sleep(randint(1, 5))

Please let me know if you need any additional support.

P.S. Break your code into functions

johnbumgarner avatar Dec 30 '21 16:12 johnbumgarner

Also take a look at my NewsPaper3k Usage Document.

I will look at adding a search example to my NewsHound project, which should be released in the coming weeks. I'm waiting on @banagale to finish his tests before the code is released 😊

johnbumgarner avatar Dec 30 '21 16:12 johnbumgarner

So based on your code you are querying Google via search -- https://google.com/search?q={query2}

This methodology will throw errors with Newspaper3k and the BeautifulSoup. I would recommend adding some error handling in your code.

Here is my Stack Overflow answer on error handling with Newspaper3k.

https://stackoverflow.com/questions/69728117/newspaper3k-filter-out-bad-url-while-extracting/69729136#69729136

Take a look at this for handling soup errors -

https://www.tutorialspoint.com/beautiful_soup/beautiful_soup_trouble_shooting.htm

I would also recommend adding a random sleep function

from time import sleep
from random import randint

# this sleep timer is helping with some timeout issues
# that were happening when querying
sleep(randint(1, 5))

Please let me know if you need any additional support.

P.S. Break your code into functions

Yea recently changed it to handle the errors that way, I've been getting this often too: [WinError 3] The system cannot find the path specified: 'C:\Users\STEELH~1\AppData\Local\Temp\.newspaper_scraper\article_resources'

This happens after a few hours of non stop scraping/ downloading articles. And every single link gets this error from this point onwards for some reason until I stop the program and run again.

steeljardas avatar Dec 30 '21 23:12 steeljardas

this path C:\Users\STEELH~1\AppData\Local\Temp.newspaper_scraper\article_resources' is used for storing content and for garbage collection. I'm going to assume that the resource becomes available for some reason.

Have you tried to increase the size of your temp directory?

johnbumgarner avatar Dec 30 '21 23:12 johnbumgarner

this path C:\Users\STEELH~1\AppData\Local\Temp.newspaper_scraper\article_resources' is used for storing content and for garbage collection. I'm going to assume that the resource becomes available for some reason.

Have you tried to increase the size of your temp directory?

It shouldn't have a limit aside from the actual SSD capacity (which still has plenty) hence why I'm not sure why it's happening

steeljardas avatar Dec 30 '21 23:12 steeljardas

Can you post your current code to paste bin so I can look at it again?

johnbumgarner avatar Dec 30 '21 23:12 johnbumgarner

Can you post your current code to paste bin so I can look at it again?

It's the same as I posted above except the "except":

try: article.parse() except (newspaper.article.ArticleException,OSError) as e: print(e)

everything else is the exact same. (Also you mentioned about me scraping google with the query thing but I'm doing that with requests not with newspaper, I do that to grab the website links, those links are the ones that I download with newspaper3k afterwards)

steeljardas avatar Dec 31 '21 00:12 steeljardas

Your code is very hard to read. I would recommend breaking it into at least 3 functions, which will help with me and you troubleshooting. If you open a question on Stack Overflow I will help you debug the code more.

Also what is your use case for scraping google for keywords and extracting content?

johnbumgarner avatar Dec 31 '21 15:12 johnbumgarner

No news from you johnbumgarner concerning the newshound project ever since! Would be glad to contribute to the code, as soon as you release it. Take care, cheers!

tsoukanas avatar Apr 01 '22 14:04 tsoukanas

@tsoukanas the initial BETA release of the project is almost done. I'm currently trying to figure out how to improve the speed of extraction, which seems slow. I'm also the only one writing and testing the code, which takes time to iron out the bugs.

BTW I have already written all the documentation for the BETA release. One feature that I won't be adding is any NLP stuff unless there becomes a reason to add it later.

johnbumgarner avatar Apr 19 '22 16:04 johnbumgarner