newspaper4k icon indicating copy to clipboard operation
newspaper4k copied to clipboard

Incorrect article text extraction (using hidden text from the one of the blocks in the side bar instead of the main content)

Open AndyTheFactory opened this issue 2 years ago • 1 comments

Issue by hus787 Fri Aug 26 09:56:11 2016 Originally opened as https://github.com/codelucas/newspaper/issues/281


https://www.upmbiofore.fi/solut-kasvavat-nanosellussa/ (html: solut-kasvavat-nanosellussa.txt) https://www.upmbiofore.fi/eu-rahoitusta-biokemikaalien-tutkimiseen/ (html: eu-rahoitusta-biokemikaalien-tutkimiseen.txt) https://www.upmbiofore.fi/kohti-kestavaa-taloutta/ (html: kohti-kestavaa-taloutt.txt)

In all the articles above (html included for future reference) newspaper is extracting the incorrect text from the side bar which is not even visible ("display: none") for the .text of the article (after download and parse) and likewise for the summary after running nlp

AndyTheFactory avatar Oct 24 '23 10:10 AndyTheFactory

Comment by ppawiggers Fri Oct 26 08:44:55 2018


Old issue, but still valid. I experience the same issue; newspaper3k should exclude hidden (display: none) elements as potential article content.

I'll see if I can find the time to submit a pull request with this fix.

AndyTheFactory avatar Oct 24 '23 10:10 AndyTheFactory