pandoc icon indicating copy to clipboard operation
pandoc copied to clipboard

Detect KaTeX math in HTML input and extract only LaTeX source

Open napaalm opened this issue 7 months ago • 0 comments

Right now pandoc generates a lot of garbage code when converting an HTML page with KaTeX math to a LaTeX file.

Here's an example converting the page https://katex.org/ to a LaTeX document:

chromium-browser --headless=new --disable-gpu --dump-dom https://katex.org/ > katex.html
pandoc -s katex.html -o katex.tex

In katex.html there's the following formula:

<div class="example tex" data-expr="\displaystyle \frac{1}{\Bigl(\sqrt{\phi \sqrt{5}}-\phi\Bigr) e^{\frac25 \pi}} = 1+\frac{e^{-2\pi}} {1+\frac{e^{-4\pi}} {1+\frac{e^{-6\pi}} {1+\frac{e^{-8\pi}} {1+\cdots} } } }"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><!-- [...] --><annotation encoding="application/x-tex">\displaystyle \frac{1}{\Bigl(\sqrt{\phi \sqrt{5}}-\phi\Bigr) e^{\frac25 \pi}} = 1+\frac{e^{-2\pi}} {1+\frac{e^{-4\pi}} {1+\frac{e^{-6\pi}} {1+\frac{e^{-8\pi}} {1+\cdots} } } }</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><!-- [...] -->

And in katex.tex pandoc renders it as {{\(\frac{1}{\left( \sqrt{\phi\sqrt{5}} - \phi \right)e^{\frac{2}{5}\pi}} = 1 + \frac{e^{- 2\pi}}{1 + \frac{e^{- 4\pi}}{1 + \frac{e^{- 6\pi}}{1 + \frac{e^{- 8\pi}}{1 + \cdots}}}}\)}{{{}{{}{{{{{{}{{{(}}{{{{{{}{{ϕ}{{{{{{}{{5}}}{{}{}}}{\hspace{0pt}}}{{{}}}}}}}{{}{}}}{\hspace{0pt}}}{{{}}}}}{}{−}{}{ϕ}{{)}}{{e}{{{{{{}{{{{}{{{{{{}{{{5}}}}{{}{}}{{}{{{2}}}}}{\hspace{0pt}}}{{{}}}}}{}}{π}}}}}}}}}}}{{}{}}{{}{{1}}}}{\hspace{0pt}}}{{{}}}}}{}}{}{=}{}}{{}{1}{}{+}{}}{{}{{}{{{{{{}{{1}{}{+}{}{{}{{{{{{}{{{1}{+}{{}{{{{{{}{{{1}{+}{{}{{{{{{}{{1}{+}{⋯}}}{{}{}}{{}{{{e}{{{{{{}{{−}{8}{π}}}}}}}}}}}{\hspace{0pt}}}{{{}}}}}{}}}}}{{}{}}{{}{{{{e}{{{{{{}{{−}{6}{π}}}}}}}}}}}}{\hspace{0pt}}}{{{}}}}}{}}}}}{{}{}}{{}{{{{e}{{{{{{}{{{−}{4}{π}}}}}}}}}}}}}{\hspace{0pt}}}{{{}}}}}{}}}}{{}{}}{{}{{{e}{{{{{{}{{{−}{2}{π}}}}}}}}}}}}{\hspace{0pt}}}{{{}}}}}{}}}}} The same applies for all other KaTeX formulas in the document.

Pandoc should be able to extract only the original LaTeX source and ignore all the other HTML tags, as per answer https://github.com/KaTeX/KaTeX/discussions/3729#discussioncomment-3769724 below reported:

Any KaTeX output contains (1) MathML and (2) the original LaTeX, so you can get both. [...] The MathML is contained in the span with class "katex-mathml" and the original LaTeX is in the <annotation> node, with encoding "application/x-tex".

napaalm avatar Jul 11 '24 13:07 napaalm