[site] bbc.co.uk
I am trying to scrape bbc.co.uk/news/world, I'm hoping to scrape 20/30 of the articles on the front page of this site
bbc_papers = newspaper.build("https://www.bbc.co.uk/news/world", number_threads=3)
article_urls = [article.url for article in bbc_papers.articles]
print(article_urls[10])
This always says list is out of index or return empty [], I'm guessing this is because the request was blocked. Does anyone know why it wont return article_urls ?
Related to https://github.com/AndyTheFactory/newspaper4k/issues/625
I think my relative links fix fixes part of this as well. At least the finding enough articles part.
Also try something like this code. Feeds are generally more reliable to get articles from than categories. And if you only want the articles from one category you can set it the same way the newspaper.build method sets it with the only_homepage flag.
from newspaper.source import Category
bbc = newspaper.Source('https://www.bbc.co.uk/news/world')
bbc.download()
bbc.parse()
bbc.categories = [Category(url=bbc.url, html=bbc.html, doc=bbc.doc)]
bbc.set_feeds()
bbc.download_feeds() # mthread
bbc.parse_categories()
bbc.generate_articles(limit=40)
#articles = list(nyt_paper.download_articles())
print(bbc.article_urls())
print(bbc.feed_urls())
One more thing is try disabling the memorization. If you've scraped once before, memorization will cause the previous results to not download again.
config = newspaper.Config()
config.disable_category_cache=True
config.memorize_articles = False
bbc = newspaper.build("https://www.bbc.com/news/world", number_threads=3, config=config)
print(bbc.article_urls())