newspaper4k
newspaper4k copied to clipboard
Invalid filtering
Issue by ZeeshanSultan
Mon May 28 13:26:47 2018
Originally opened as https://github.com/codelucas/newspaper/issues/572
https://github.com/codelucas/newspaper/blob/c521057b20bb3d4cd27d8b0ee6efd64d1d3a488f/newspaper/urls.py#L239
The validator uses blacklist based filters to detect bad urls and then whitelist based filter to detect valid urls but the default response is False which should be true since the url passed all blacklist filters and the whitelist filters aren't too broad based on very limited keywords.
Here's an example site which doesn't get detcted http://jewishnews.net.au