pandoc
pandoc copied to clipboard
Detect KaTeX math in HTML input and extract only LaTeX source
Right now pandoc generates a lot of garbage code when converting an HTML page with KaTeX math to a LaTeX file.
Here's an example converting the page https://katex.org/ to a LaTeX document:
chromium-browser --headless=new --disable-gpu --dump-dom https://katex.org/ > katex.html
pandoc -s katex.html -o katex.tex
In katex.html
there's the following formula:
<div class="example tex" data-expr="\displaystyle \frac{1}{\Bigl(\sqrt{\phi \sqrt{5}}-\phi\Bigr) e^{\frac25 \pi}} = 1+\frac{e^{-2\pi}} {1+\frac{e^{-4\pi}} {1+\frac{e^{-6\pi}} {1+\frac{e^{-8\pi}} {1+\cdots} } } }"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><!-- [...] --><annotation encoding="application/x-tex">\displaystyle \frac{1}{\Bigl(\sqrt{\phi \sqrt{5}}-\phi\Bigr) e^{\frac25 \pi}} = 1+\frac{e^{-2\pi}} {1+\frac{e^{-4\pi}} {1+\frac{e^{-6\pi}} {1+\frac{e^{-8\pi}} {1+\cdots} } } }</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><!-- [...] -->
And in katex.tex
pandoc renders it as {{\(\frac{1}{\left( \sqrt{\phi\sqrt{5}} - \phi \right)e^{\frac{2}{5}\pi}} = 1 + \frac{e^{- 2\pi}}{1 + \frac{e^{- 4\pi}}{1 + \frac{e^{- 6\pi}}{1 + \frac{e^{- 8\pi}}{1 + \cdots}}}}\)}{{{}{{}{{{{{{}{{{(}}{{{{{{}{{ϕ}{{{{{{}{{5}}}{{}{}}}{\hspace{0pt}}}{{{}}}}}}}{{}{}}}{\hspace{0pt}}}{{{}}}}}{}{−}{}{ϕ}{{)}}{{e}{{{{{{}{{{{}{{{{{{}{{{5}}}}{{}{}}{{}{{{2}}}}}{\hspace{0pt}}}{{{}}}}}{}}{π}}}}}}}}}}}{{}{}}{{}{{1}}}}{\hspace{0pt}}}{{{}}}}}{}}{}{=}{}}{{}{1}{}{+}{}}{{}{{}{{{{{{}{{1}{}{+}{}{{}{{{{{{}{{{1}{+}{{}{{{{{{}{{{1}{+}{{}{{{{{{}{{1}{+}{⋯}}}{{}{}}{{}{{{e}{{{{{{}{{−}{8}{π}}}}}}}}}}}{\hspace{0pt}}}{{{}}}}}{}}}}}{{}{}}{{}{{{{e}{{{{{{}{{−}{6}{π}}}}}}}}}}}}{\hspace{0pt}}}{{{}}}}}{}}}}}{{}{}}{{}{{{{e}{{{{{{}{{{−}{4}{π}}}}}}}}}}}}}{\hspace{0pt}}}{{{}}}}}{}}}}{{}{}}{{}{{{e}{{{{{{}{{{−}{2}{π}}}}}}}}}}}}{\hspace{0pt}}}{{{}}}}}{}}}}}
The same applies for all other KaTeX formulas in the document.
Pandoc should be able to extract only the original LaTeX source and ignore all the other HTML tags, as per answer https://github.com/KaTeX/KaTeX/discussions/3729#discussioncomment-3769724 below reported:
Any KaTeX output contains (1) MathML and (2) the original LaTeX, so you can get both. [...] The MathML is contained in the span with class "katex-mathml" and the original LaTeX is in the
<annotation>
node, with encoding "application/x-tex".