newspaper.article.ArticleException obscures underlying cause of exception
Issue by dviator
Fri Jul 8 01:57:25 2016
Originally opened as https://github.com/codelucas/newspaper/issues/268
Hey, love the library but I am having a little trouble with the way that newspaper.article.ArticleException works. My use case is to directly call article.download() and article.parse() on a list of urls that I'm feeding from another part of my application. The basic issue is that newspaper.article.ArticleException does not differentiate between different causes of failure. In my case, between network timeouts and malformed pages. Quick shell test case here:.
The impact to my application is that I wrap the calls to article.parse() in a retry block so that intermittent network latency can be overcome while my application runs continuously. However when I run into malformed pages, I'd like the application to notice the incomplete response and skip right over them, but it retries instead, causing a large and unnecessary performance impact when there are many consecutive malformed pages. I'm sure I could perform some additional checking in the wrapper code, but the cleaner solution seems to be to throw a different exception when article.download() fails and when article.parse() fails, or to differentiate the cause of errors some other way. In fact, when I first started using the library it was a source of confusion that article.download() would not throw an exception when the network was disabled.
I would be happy to work on contributing a solution if the above seems reasonable to you all who have been maintaining this excellent code. In any event would love to hear your thoughts.
Comment by silviaegt
Fri Jul 13 22:29:49 2018
Did you get an answer on this @maevyn11? In my case it was a "failed with 404 Client Error: Not Found for url" problem
I tried to avoid this with
try:
except Exception:
pass
But it didn't work....