newspaper4k [site] bbc.co.uk

I am trying to scrape bbc.co.uk/news/world, I'm hoping to scrape 20/30 of the articles on the front page of this site

bbc_papers = newspaper.build("https://www.bbc.co.uk/news/world", number_threads=3)

article_urls = [article.url for article in bbc_papers.articles]
print(article_urls[10])

This always says list is out of index or return empty [], I'm guessing this is because the request was blocked. Does anyone know why it wont return article_urls ?

Feb 05 '25 18:02 kdenaeem

Related to https://github.com/AndyTheFactory/newspaper4k/issues/625

Mar 09 '25 20:03 AndyTheFactory

I think my relative links fix fixes part of this as well. At least the finding enough articles part.

Apr 01 '25 06:04 BRNMan

Also try something like this code. Feeds are generally more reliable to get articles from than categories. And if you only want the articles from one category you can set it the same way the newspaper.build method sets it with the only_homepage flag.

from newspaper.source import Category

bbc = newspaper.Source('https://www.bbc.co.uk/news/world')
bbc.download()
bbc.parse()
bbc.categories = [Category(url=bbc.url, html=bbc.html, doc=bbc.doc)]
bbc.set_feeds()
bbc.download_feeds()  # mthread
bbc.parse_categories()
bbc.generate_articles(limit=40)

#articles = list(nyt_paper.download_articles())
print(bbc.article_urls())
print(bbc.feed_urls())

Apr 03 '25 15:04 BRNMan

One more thing is try disabling the memorization. If you've scraped once before, memorization will cause the previous results to not download again.

config = newspaper.Config()
config.disable_category_cache=True
config.memorize_articles = False

bbc = newspaper.build("https://www.bbc.com/news/world", number_threads=3, config=config)

print(bbc.article_urls())

Apr 03 '25 16:04 BRNMan