Refactor and update newsscraper

Open fawazshah opened this issue 4 years ago • 0 comments

This PR aims to add data checkpointing and extra error-handling, as well as improving the readability of code.

We now catch errors when calling newspaper.build
We increment a new variable error_count when we encounter any errors when downloading/parsing articles, or any NoneType publish dates. If error_count > 10 we skip to the next article. (Previously we only skip if encountering 10 or more NoneType dates only)
We remove the unneeded count function parameter
We print which number news site out of the total number of sites we are scraping right now (e.g. "NEWS SITE 3 OUT OF 99")
We now save scraped data to JSON after each news site is processed rather than at the very end of processing, meaning if the script gets interrupted any data collected so far is saved
We remove the default limit parameter in run so it doesn't override the user-inputted limit

Apr 22 '21 15:04 fawazshah