NewsScraper
NewsScraper copied to clipboard
Refactor and update newsscraper
This PR aims to add data checkpointing and extra error-handling, as well as improving the readability of code.
- We now catch errors when calling
newspaper.build - We increment a new variable
error_countwhen we encounter any errors when downloading/parsing articles, or any NoneType publish dates. Iferror_count> 10 we skip to the next article. (Previously we only skip if encountering 10 or more NoneType dates only) - We remove the unneeded
countfunction parameter - We print which number news site out of the total number of sites we are scraping right now (e.g. "NEWS SITE 3 OUT OF 99")
- We now save scraped data to JSON after each news site is processed rather than at the very end of processing, meaning if the script gets interrupted any data collected so far is saved
- We remove the default limit parameter in
runso it doesn't override the user-inputted limit