aut icon indicating copy to clipboard operation
aut copied to clipboard

Replace Boilerpipe functionality with more modern Readability clone

Open mjsuhonos opened this issue 6 months ago • 1 comments

Problem Class ExtractBoilerpipeText doesn't fully do what it purports to; ie. it sometimes leaves (often large) portions of eg. header and comment thread text in the output. Boilerpipe was last updated 10 years ago.

Preferred solution Boilerplate removal more consistent with Readability.js, based on a more modern Java/Scala library.

Alternatives considered There are two libraries available:

Additional context Advice on selecting a library would be much appreciated; I suspect the main consideration will be maintainability.

I'm happy to write up a test and PR for this issue once there's a decision. I can provide failing examples if that's helpful.

mjsuhonos avatar Jul 06 '25 15:07 mjsuhonos