Incorrect article text extraction (using hidden text from the one of the blocks in the side bar instead of the main content)
Issue by hus787
Fri Aug 26 09:56:11 2016
Originally opened as https://github.com/codelucas/newspaper/issues/281
https://www.upmbiofore.fi/solut-kasvavat-nanosellussa/ (html: solut-kasvavat-nanosellussa.txt) https://www.upmbiofore.fi/eu-rahoitusta-biokemikaalien-tutkimiseen/ (html: eu-rahoitusta-biokemikaalien-tutkimiseen.txt) https://www.upmbiofore.fi/kohti-kestavaa-taloutta/ (html: kohti-kestavaa-taloutt.txt)
In all the articles above (html included for future reference) newspaper is extracting the incorrect text from the side bar which is not even visible ("display: none") for the .text of the article (after download and parse) and likewise for the summary after running nlp
Comment by ppawiggers
Fri Oct 26 08:44:55 2018
Old issue, but still valid. I experience the same issue; newspaper3k should exclude hidden (display: none) elements as potential article content.
I'll see if I can find the time to submit a pull request with this fix.