aut
aut copied to clipboard
Replace Boilerpipe functionality with more modern Readability clone
Problem Class ExtractBoilerpipeText doesn't fully do what it purports to; ie. it sometimes leaves (often large) portions of eg. header and comment thread text in the output. Boilerpipe was last updated 10 years ago.
Preferred solution Boilerplate removal more consistent with Readability.js, based on a more modern Java/Scala library.
Alternatives considered There are two libraries available:
- Readability4J: more recent, more active
- readability4s: pure Scala, but moribund
Additional context Advice on selecting a library would be much appreciated; I suspect the main consideration will be maintainability.
I'm happy to write up a test and PR for this issue once there's a decision. I can provide failing examples if that's helpful.