reader icon indicating copy to clipboard operation
reader copied to clipboard

Option to toggle the usage of Readability?

Open andreekeberg opened this issue 10 months ago • 2 comments

As I was experimenting with your API I noticed that it was a bit "too aggressive" on some pages, removing sections that I would want to keep in the final Markdown.

So I looked around both in the project code, as well as setting up an isolated test that only used turndown directly, but finally I found that the "culprit" was @mozilla/readability.

While this seems to do a great job at removing "irrelevant" content before it's converted to Markdown in most cases, I can definitely see how it might be a bit too greedy/aggressive in its cleanup strategy (i.e. not only in my specific case), and since I couldn't really find any combinations of config options for Readability that kept the specific "hero" section on the page I was trying with, I instead wanted to suggest that you might add the ability to simply enable/disable Readability completely?

As I'm not using the project by hosting the actual code locally or on my own server, but rather just using your public API, the ideal scenario would therefore be if this toggle could even exist as e.g. an extra parameter or alternative API endpoint.

Of course turndown should still be configured to remove things like <script> and <style> when not using Readability (if you don't already explicitly do this), but other than that I really think this alternative parsing option could be a very valuable addition!

andreekeberg avatar Apr 16 '24 14:04 andreekeberg

thanks for digging in ❤️

we are also exploring different combinations rn, definitely something we can improve

hanxiao avatar Apr 16 '24 18:04 hanxiao

About

Joelokon avatar Apr 17 '24 10:04 Joelokon

Yes, it now has. You can use headers to control the fetching behavior. Read more here: https://github.com/jina-ai/reader?tab=readme-ov-file#using-request-headers

hanxiao avatar Apr 24 '24 15:04 hanxiao