newspaper4k icon indicating copy to clipboard operation
newspaper4k copied to clipboard

[site] bbc.co.uk

Open kdenaeem opened this issue 11 months ago • 4 comments

I am trying to scrape bbc.co.uk/news/world, I'm hoping to scrape 20/30 of the articles on the front page of this site

bbc_papers = newspaper.build("https://www.bbc.co.uk/news/world", number_threads=3)

article_urls = [article.url for article in bbc_papers.articles]
print(article_urls[10])

This always says list is out of index or return empty [], I'm guessing this is because the request was blocked. Does anyone know why it wont return article_urls ?

kdenaeem avatar Feb 05 '25 18:02 kdenaeem

Related to https://github.com/AndyTheFactory/newspaper4k/issues/625

AndyTheFactory avatar Mar 09 '25 20:03 AndyTheFactory

I think my relative links fix fixes part of this as well. At least the finding enough articles part.

BRNMan avatar Apr 01 '25 06:04 BRNMan

Also try something like this code. Feeds are generally more reliable to get articles from than categories. And if you only want the articles from one category you can set it the same way the newspaper.build method sets it with the only_homepage flag.

from newspaper.source import Category

bbc = newspaper.Source('https://www.bbc.co.uk/news/world')
bbc.download()
bbc.parse()
bbc.categories = [Category(url=bbc.url, html=bbc.html, doc=bbc.doc)]
bbc.set_feeds()
bbc.download_feeds()  # mthread
bbc.parse_categories()
bbc.generate_articles(limit=40)

#articles = list(nyt_paper.download_articles())
print(bbc.article_urls())
print(bbc.feed_urls())

BRNMan avatar Apr 03 '25 15:04 BRNMan

One more thing is try disabling the memorization. If you've scraped once before, memorization will cause the previous results to not download again.

config = newspaper.Config()
config.disable_category_cache=True
config.memorize_articles = False

bbc = newspaper.build("https://www.bbc.com/news/world", number_threads=3, config=config)

print(bbc.article_urls())

BRNMan avatar Apr 03 '25 16:04 BRNMan