newspaper4k icon indicating copy to clipboard operation
newspaper4k copied to clipboard

Text Cleanup

Open AndyTheFactory opened this issue 2 years ago • 0 comments

Issue by roniemartinez Tue Mar 26 13:46:18 2019 Originally opened as https://github.com/codelucas/newspaper/issues/692


I like how newspaper3k extract articles and perform summarization. Great work!! 👍

Just a minor issue though. There are irrelevant contents that we could call "unreadable". For example:

  • when parsing Wikipedia articles, there are texts like [Edit], [1], [2], etc.
  • when parsing Github repositories (README page), code samples are still included.

I like how goose3 perform text cleanup (though there is no summarization support). Perhaps you could take a look at their implementation.

AndyTheFactory avatar Oct 24 '23 14:10 AndyTheFactory