newspaper icon indicating copy to clipboard operation
newspaper copied to clipboard

Some article texts are not fully downloaded.

Open Jimchoo91 opened this issue 1 year ago • 2 comments

Hi, I have only found this on one website so far, but when I try to download the full text from an article on the BBC, it only returns a snippet.

Here is an example website:

https://www.bbc.co.uk/news/world-48810070

Any idea why? Thanks.

Jimchoo91 avatar Aug 31 '22 15:08 Jimchoo91

While it's more than a snippet, the full text of articles from Politico don't get pulled either.

I believe the main issue at heart is the code used to parse these websites is so old (last commit to main code is 4+ years old), it's not handling the html source properly due to website updates. Big name websites will change their layouts a lot more frequently than 4 years.

I am sure this library was great in its hay day, but it's near unusable now unless it's on smaller websites that haven't changed a thing in the last 5 years. Which doesn't leave many given that even WordPress-based websites have changed quite a bit.

bstivers avatar Sep 17 '22 08:09 bstivers

The library has lots of limitations, because the code base is old. You can parse the BBC site text with some additional code. Here is a document that I wrote on using the library. I will update it in the coming days with the code to extract the BBC text.

johnbumgarner avatar Dec 30 '22 18:12 johnbumgarner