newspaper4k
newspaper4k copied to clipboard
Iterating articles on news source produces duplicates, if subdomain omitted.
Issue by awiebe
Sun Jun 17 02:00:15 2018
Originally opened as https://github.com/codelucas/newspaper/issues/580
I was testing news sources, and found that this article was emitted twice, despite the fact that newspaper should be memoizing.
The problem seems to be that memoization uses the straight url and doesn't consider that the second source is missing the www subdomain.
https://www.theatlantic.com/politics/archive/2018/05/stephen-miller-trump-adviser/561317/
Trump’s Right-Hand Troll
['Mckay Coppins']
http://theatlantic.com/politics/archive/2018/05/stephen-miller-trump-adviser/561317/
Trump’s Right-Hand Troll
['Mckay Coppins']
import newspaper
def dump_article(a):
try:
a.download()
a.parse()
print(a.title)
print(a.authors)
# print (a.text)
return True
except :
return False
MAX_PULL=10
for source in newspaper.popular_urls():
print(source)
pull=0
s=newspaper.build(source,lang='en')
for a in s.articles:
print(a.url)
if dump_article(a):
pull+=1
if pull>= MAX_PULL:
break
Comment by minhdanh
Sat Mar 6 14:49:49 2021
Having same problem here in 2021. @awiebe Have you by any chance had a solution?