html-to-markdown
html-to-markdown copied to clipboard
🐛 Bug: Support MathJax custom tags
Describe the bug
MathJax is a JavaScript library allowing to add "custom tags" such as $...$
to HTML which will then be turned into e.g., MathML or whatever the browser supports.
Depending on the Markdown implementation math is either not supported at all -- or directly through the same syntax. Either way, it'd probably make most sense to simply keep $...$
expressions intact and not escape strings contained therein. While a simple filter for that would certainly work, MathJax allows supporting different escape characters than $...$
for inline- and $$...$$
for display-math, e.g., from the article https://math.andrej.com/2007/09/28/seemingly-impossible-functional-programs/:
<script>
window.MathJax = {
tex: {
tags: "ams", inlineMath: [ ['$','$'], ['\\(', '\\)'] ],
displayMath: [ ['$$','$$'] ],
processEscapes: true,
},
options: {
skipHtmlTags: ['script', 'noscript', 'style', 'textarea', 'pre', 'code']
},
loader: {
load: ['[tex]/amscd'] }
};
</script>
This would necessate parsing Js though ...
HTML Input
some formula: $\lambda$
Generated Markdown
some formula: $\\lambda$
Expected Markdown
some formula: $\lambda$
Additional context This filter (or "unfilter") may be only activated, if MathJax is detected, and otherwise disabled. Further, as mentioned earlier, a more sophisticated parsing of the HTML may be used to detect the precise math-HTML tags used or make them configurable at the least.
I don't think getting the content between the $
signs will always work, as it can also be server-side-rendered. Luckily it seems like both MathJax and Katex (also) support the <math>
tag.
So a math
plugin would need to support both methods:
it will typically have a $\lambda$-expression as argument.
<mjx-assistive-mml unselectable="on" display="inline">
<math xmlns="http://www.w3.org/1998/Math/MathML">
<mi>λ</mi>
</math>
</mjx-assistive-mml>
I won't add this plugin anytime soon, as it would be a lot of work. But this plugin should exist! Ideally maintained by someone better in math than me 😅
I'm planning a v2 of the library. Maybe I will add it then...
You could already help by collecting various snippets from websites you encounter. This should cover a variety of uses (e.g. client-side-rendering, server-side-rendering, different libraries, content that looks like math but is NOT, ...)
See this file as an example. It follows this pattern:
<!-- https://example.com/page1 -->
<div>snippet 1</div>
<hr />
<!-- https://example.com/page1 -->
<p>snippet 2</p>
<hr />
...
Thanks for implementing #49 so quickly!
Yeah, MathJax supports LaTeX-Style, MathML as well as AsciiMath. Converting MathML to Markdown however is probably quite much work. Simply "passing through" dollar-signs if so-configured in the scripts may work "good enough" for most use cases though?
I've just noticed that pandoc can do just the thing:
pandoc --from=html+tex_math_dollars+tex_math_single_backslash+tex_math_double_backslash \
--to=markdown \
--output=foo.md \
input.html
You can also choose --to=html
to convert e.g., `$\lambda. \dots$ to:
<span class="math inline"><em>λ</em><em>i</em>.…</span>
Which works good enough for my use cases for now. Adding real $
support is quite tricky, especially when it comes to finding the closing tag etc.
Regardless, I will collect examples I stumble upon :)