newspaper4k icon indicating copy to clipboard operation
newspaper4k copied to clipboard

href url in news html source and scrape urls from Newspaper counts differ

Open AndyTheFactory opened this issue 2 years ago • 0 comments

Issue by harishaaram Sat Oct 7 15:03:54 2017 Originally opened as https://github.com/codelucas/newspaper/issues/455


Hi,

I was getting around thousand articles while the homepage of the news website shows only 300 (href url links).

How do you I get only articles(or text, summary) relevant to those links?

Here is my code: `news_content = newspaper.build("https://www.reuters.com/",memoize_articles=False, language='en', fetch_images = False, number_threads = 1) print(news_content.size()) for eachArticle in news_content.articles:#url links i = i +1 try : article = news_content.articles[i]

        article.download()#now download and parse each articles
        article.parse()

        article.nlp()


        backupfile.write("\n"+ "--------------------------------------------------------------" + "\n")

        datasetfile.write("\n" + "----Title -> No. " + str(i) + "\n")
        datasetfile.write(article.title)

        # print(article.title)
        datasetfile.write("\n" + "----URL-> No. "+ str(i) + "\n")
        datasetfile.write(eachArticle.url) #only summary of the article is written in the dataset directory

    except:
        pass`

AndyTheFactory avatar Oct 24 '23 10:10 AndyTheFactory