python-readability icon indicating copy to clipboard operation
python-readability copied to clipboard

Feature: Please add plain text output functionality

Open yevgenpapernyk opened this issue 5 years ago • 6 comments

Like .summary() but plain text instead of the .summary() html version.

E.g. as a new method or as an argument for the .summary() method.

That would be very useful for Natural Language Processing.

yevgenpapernyk avatar Jul 24 '20 14:07 yevgenpapernyk

I highly recommend using html2text library on the .summary() output for that.

    converter = HTML2Text()
    converter.ignore_links = True
    converter.ignore_emphasis = True
    converter.body_width = 0
    text = converter.handle(html)
    return text

given that it's that easy and that different people need different rendering options, and the options might change over time and I would need to reflect them in the library interface, I'd like to leave it as is. However, I might consider adding a simple version, for that you need just .text_content() in lxml.

buriy avatar Jul 24 '20 15:07 buriy

Shameless plug: trafilatura builds upon readability-lxml and can convert the output to TXT, XML, CSV and JSON.

adbar avatar Jul 29 '20 17:07 adbar

However, I might consider adding a simple version, for that you need just .text_content() in lxml.

So I'll leave the issue opened until you decide whether you want to add it, right?

yevgenpapernyk avatar Aug 24 '20 09:08 yevgenpapernyk

Is there a plan to support textContent like we have in JS module https://github.com/mozilla/readability#parse?

Yes if many people want an easy way to have text output, I'll add it.

buriy avatar Aug 17 '21 04:08 buriy

Could you please support to get clear text content?

Seshu77 avatar Aug 20 '21 05:08 Seshu77