Question regarding Markdown structure in Jina Reader API

Open medmabcf opened this issue 1 year ago • 1 comments

Hi,

I’m trying to understand the specific markdown structure used by the Jina Reader API when converting HTML to markdown. For instance, I’ve observed the following mappings:

<h1> tags are mapped to ==========
<h2> tags are mapped to ------

Is this the standard markdown structure followed by the Jina Reader API? Additionally, I’ve noticed that the output can sometimes vary. Is this due to the use of a heuristic method or some other factor?

Thanks!

Nov 06 '24 14:11 medmabcf

We are using turndown for HTML to Markdown transformation. Whether h1/h2 gets transformed into ## or ==/-- can be configured with turndown, but we have not customized this option and followed the default.

The default output sometimes changes because Reader automatically switches the use of readability for some level of smart trimming. If readability would apparently not work for the page we fall back to a rule-based approach known as markdown.

If you find the markdown format preferable, you can specify x-respond-with: markdown or x-return-format: markdown to stabilize the return format.

Nov 12 '24 06:11 nomagick