newspaper4k
newspaper4k copied to clipboard
Text Cleanup
Issue by roniemartinez
Tue Mar 26 13:46:18 2019
Originally opened as https://github.com/codelucas/newspaper/issues/692
I like how newspaper3k extract articles and perform summarization. Great work!! 👍
Just a minor issue though. There are irrelevant contents that we could call "unreadable". For example:
- when parsing Wikipedia articles, there are texts like
[Edit],[1],[2], etc. - when parsing Github repositories (README page), code samples are still included.
I like how goose3 perform text cleanup (though there is no summarization support). Perhaps you could take a look at their implementation.