newspaper4k
newspaper4k copied to clipboard
href url in news html source and scrape urls from Newspaper counts differ
Issue by harishaaram
Sat Oct 7 15:03:54 2017
Originally opened as https://github.com/codelucas/newspaper/issues/455
Hi,
I was getting around thousand articles while the homepage of the news website shows only 300 (href url links).
How do you I get only articles(or text, summary) relevant to those links?
Here is my code: `news_content = newspaper.build("https://www.reuters.com/",memoize_articles=False, language='en', fetch_images = False, number_threads = 1) print(news_content.size()) for eachArticle in news_content.articles:#url links i = i +1 try : article = news_content.articles[i]
article.download()#now download and parse each articles
article.parse()
article.nlp()
backupfile.write("\n"+ "--------------------------------------------------------------" + "\n")
datasetfile.write("\n" + "----Title -> No. " + str(i) + "\n")
datasetfile.write(article.title)
# print(article.title)
datasetfile.write("\n" + "----URL-> No. "+ str(i) + "\n")
datasetfile.write(eachArticle.url) #only summary of the article is written in the dataset directory
except:
pass`