markdown
markdown copied to clipboard
Use Rmodepdf and LuaXML to display block HTML elements
As discussed with @michal-h21 before and after their TUG 2024 talk (slides, preprint), we may want to look into using the LuaXML library with the default transformation rules from rmodepdf to display block HTML elements.
For inline HTML elements, this does not seem applicable, because inline HTML elements produce renderers that do not necessarily represent complete HTML fragments that can be represented in DOM:
$ docker run --rm -it witiko/markdown markdown-cli html=true <<< 'Hello <i>world</i>!'
\markdownRendererDocumentBegin
Hello \markdownRendererInlineHtmlTag{<i>}world\markdownRendererInlineHtmlTag{</i>}!\markdownRendererDocumentEnd
We can't easily change this, since the CommonMark standard allows Markdown markup within inline HTML elements.
You can wrap HTML fragments in some dummy element to prevent parsing issues. I also think that you can process the text nodes for Markdown, so it should be possible to use it here.
This is a proof of concept:
kpse.set_program_name "luatex"
local domobject = require("luaxml-domobject")
local transform = require("luaxml-transform")
local function parse(block)
-- wrap the text in a container element, so it doesn't matter that the HTML markup can be incomplete
-- <body> is a good candidate
local dom = domobject.html_parse("<body>" .. block .. "</body>")
return dom
end
local function should_expand(element)
-- test if we should expand markdown in this element
local element_name = element:get_element_name()
-- do some tests with the element name
-- ...
-- for now, just return true
return true
end
local function process_markdown(text)
-- this is just an example. the real funtion would need to be much more complex
text = text:gsub("%*(..-)%*", "\\textit{%1}")
return text
end
local function expand_markdown(element)
-- recursively loop over child elements and expand markdown in text nodes
for i, child in ipairs(element:get_children()) do
if child:is_element() then
-- recurse for child elements
expand_markdown(child)
elseif child:is_text() and should_expand(element) then
-- run this only on text nodes in elements that should be processed
child._text = process_markdown(child._text)
end
end
end
local transformer = transform.new()
-- disable escaping of TeX commands and braces
transformer.unicodes = {
[92] = nil,
[123] = nil,
[125] = nil,
}
-- actions for HTML elements
transformer:add_action("i", "\\textit{%s}")
transformer:add_action("b", "\\textbf{%s}")
local test = "Hello <i>world</i>! Another text <b>with *markdown*</b>"
local dom = parse(test)
expand_markdown(dom:root_node())
-- debugging print of the processed DOM
print(dom:serialize())
-- and now convert to TeX
print(transformer:process_dom(dom))
For this test string: "Hello <i>world</i>! Another text <b>with *markdown*</b>" it produces the following output:
<body>Hello <i>world</i>! Another text <b>with \textit{markdown}</b></body>
Hello \textit{world}! Another text \textbf{with \textit{markdown}}
That's a compelling approach: First, parse the Markdown document an HTML document, construct a DOM and only then convert the text nodes from Markdown to LaTeX. However, it seems incompatible with the current approach of CommonMark in general and the Markdown package in particular, where we first parse the whole document as a Markdown document and then we identify HTML code within the document.
An alternative would be to redefine \markdownRendererInlineHtmlTag to scan ahead for all other \markdownRendererInlineHtmlTags within the same paragraph/block, replace the intervening texts with unique identifiers, process the combined text with LuaXML, and replace the identifiers with the intervening texts. For example:
- Take
Hello \markdownRendererInlineHtmlTag{<i>}world\markdownRendererInlineHtmlTag{</i>}!. - Combine all inline tags to a single string:
<i>world</i>. - Replace the intervening texts with unique identifiers:
<i>TEXT1</i>. - Process with LuaXML:
\textit{TEXT1} - Replace the unique identifiers with intervening texts:
\textit{world}.
However, this seems like a lot of plumbing in TeX, which runs the risk of breaking commands that change catcodes such as \verb in hybrid mode. A better option would be to introduce a renderer \markdownRendererInlineHtmlFragment instead, which would receive a variable number of parameters:
\markdownRendererInlineHtmlFragment{2}{<i>}{world}{</i>}
However, we can't just do that without breaking compatibility, since users may already rely on \markdownRendererInlineHtmlTag. Perhaps we can have a backwards-compatibility definition of \markdownRendererInlineHtmlFragment that would expand to \markdownRendererInlineHtmlTag{<i>}world\markdownRendererInlineHtmlTag{</i>} in my example. This definition would be used when the user has redefined \markdownRendererInlineHtmlTag or \markdownRendererInlineHtmlTagPrototype. This would allow us to use LuaXML with both block and inline HTML elements.
Well, I don't know much about CommonMark and also how the Markdown package processes the document, so I am not sure what the best way is, so I cannot comment on this :( I can only help on the LuaXML end, I am afraid.
That's OK, few people do! I am happy to put in the work on the Markdown side of things.
A better option would be to introduce a renderer
\markdownRendererInlineHtmlFragmentinstead, which would receive a variable number of parameters:\markdownRendererInlineHtmlFragment{2}{<i>}{world}{</i>}
However, things would still break if, instead of "world", there were some brittle content that needs to appear at the top level of a file. We can still fix this by putting "world" into a separate file.
[...] we may want to look into using the LuaXML library with the default transformation rules from rmodepdf to display block HTML elements.
Come to think of it, in CommonMark, block HTML elements do not necessarily represent complete HTML fragments that can be represented in DOM either. Therefore, we would need to do something similar to the command markdownRendererInlineHtmlFragment on the level of blocks.
Both changes seem significant and possibly breaking for some users. Let's do something simpler instead and only use Rmodepdf and LuaXML for raw HTML blocks and HTML file transclusion, as these are both very likely to contain complete HTML fragments.
Both changes seem significant and possibly breaking for some users. Let's do something simpler instead and only use Rmodepdf and LuaXML for raw HTML blocks and HTML file transclusion, as these are both very likely to contain complete HTML fragments.
@michal-h21 Here is a demo of the new experimental LaTeX defaults for the Markdown package for TeX that use LuaXML:
\documentclass{article}
\usepackage[experimental, raw_attribute, content_blocks]{markdown}
\begin{filecontents}[overwrite, nosearch, noheader]{example.html}
<b>foo</b> <i>bar</i>
\end{filecontents}
\begin{document}
\begin{markdown}
Raw text span: `<b>foo</b> <i>bar</i>`{=html}
Raw code block:
``` {=html}
<b>foo</b> <i>bar</i>
```
Content block:
/example.html
\end{markdown}
\end{document}
Here is the complete code that reproduces the above output with the current version of the Markdown package:
\documentclass{article}
\usepackage{markdown}
\begin{filecontents}[overwrite, nosearch, noheader]{example.html}
<b>foo</b> <i>bar</i>
\end{filecontents}
\directlua{
function convert_html(input)
local dom = require("luaxml-domobject").html_parse("<body>" .. input .. "</body>")
local output = require("rmodepdf-htmltemplates"):process_dom(dom)
return output
end
function print_html(input_filename)
local input_file = assert(io.open(input_filename, "r"))
local input = assert(input_file:read("*a"))
assert(input_file:close())
local output = convert_html(input)
tex.print(output)
end
}
\ExplSyntaxOn
\markdownSetup
{
raw_attribute,
content_blocks,
renderers = {
inputRaw(Inline|Block) = {
\str_case:nn
{ #2 }
{
{ html }
{
\lua_now:e
{ print_html("\lua_escape:n { #1 }") }
}
}
},
contentBlock = {
\str_case:nn
{ #1 }
{
{ html }
{
\lua_now:e
{ print_html("\lua_escape:n { #3 }") }
}
}
},
},
}
\ExplSyntaxOff
\begin{document}
\begin{markdown}
Raw text span: `<b>foo</b> <i>bar</i>`{=html}
Raw code block:
``` {=html}
<b>foo</b> <i>bar</i>
```
Content block:
/example.html
\end{markdown}
\end{document}
The only issue I currently have is with Rmodepdf:
- It's not on CTAN, which means that users would need to manually download
rmodepdf-htmltemplates.lua. - The file
rmodepdf-htmltemplates.luaseems incompatible with the current version of LuaXML and needs to be modified as follows before the above example document works with it:
diff --git a/rmodepdf-htmltemplates.lua b/rmodepdf-htmltemplates.lua
index 1d71d94..f1e06e7 100644
--- a/rmodepdf-htmltemplates.lua
+++ b/rmodepdf-htmltemplates.lua
@@ -1,10 +1,10 @@
-local xmltransform = require "luaxml-transform"
+local xmltransform = require("luaxml-transform").new()
-- this trick is used to print @{} in TeX: @@{}{}
-xmltransform.add_action("head", [[
+xmltransform:add_action("head", [[
\tableofcontents
]])
--- xmltransform.add_action("head", [[
+-- xmltransform:add_action("head", [[
-- \noindent\begin{tabular}{@@{}{}p{.2\textwidth}p{.75\textwidth}@@{}{}}
-- %s
-- \end{tabular}\par\bigskip
@@ -12,36 +12,36 @@ xmltransform.add_action("head", [[
-- \tableofcontents
-- ]])
--- xmltransform.add_action("meta", [[\textbf{@{name}} & @{content}\\ ]])
--- xmltransform.add_action("meta[name='author']", [[\textbf{@{name}} & \textbf{@{content}}\\ ]])
--- xmltransform.add_action("title", [[\textbf{title} & %s\\ ]])
-xmltransform.add_action("img", [[\noindent\includegraphics[max width=\textwidth]{@{src}}]])
+-- xmltransform:add_action("meta", [[\textbf{@{name}} & @{content}\\ ]])
+-- xmltransform:add_action("meta[name='author']", [[\textbf{@{name}} & \textbf{@{content}}\\ ]])
+-- xmltransform:add_action("title", [[\textbf{title} & %s\\ ]])
+xmltransform:add_action("img", [[\noindent\includegraphics[max width=\textwidth]{@{src}}]])
-xmltransform.add_action("h1", [[\addcontentsline{toc}{section}{%s}\section*{%s}
+xmltransform:add_action("h1", [[\addcontentsline{toc}{section}{%s}\section*{%s}
]])
-xmltransform.add_action("h2", [[\addcontentsline{toc}{subsection}{%s}\subsection*{%s}
+xmltransform:add_action("h2", [[\addcontentsline{toc}{subsection}{%s}\subsection*{%s}
]])
-- don't add lower sectioning level than subsection
-xmltransform.add_action("h3", [[\addcontentsline{toc}{subsubsection}{%s}\subsubsection*{%s}
+xmltransform:add_action("h3", [[\addcontentsline{toc}{subsubsection}{%s}\subsubsection*{%s}
]])
-xmltransform.add_action("h4", [[\addcontentsline{toc}{subsubsection}{%s}\subsubsection*{%s}
+xmltransform:add_action("h4", [[\addcontentsline{toc}{subsubsection}{%s}\subsubsection*{%s}
]])
-xmltransform.add_action("h5", [[\addcontentsline{toc}{subsubsection}{%s}\subsubsection*{%s}
+xmltransform:add_action("h5", [[\addcontentsline{toc}{subsubsection}{%s}\subsubsection*{%s}
]])
-xmltransform.add_action("h6", [[\addcontentsline{toc}{subsubsection}{%s}\subsubsection*{%s}
+xmltransform:add_action("h6", [[\addcontentsline{toc}{subsubsection}{%s}\subsubsection*{%s}
]])
-xmltransform.add_action("i", [[\textit{%s}]])
-xmltransform.add_action("em", [[\emph{%s}]])
-xmltransform.add_action("b", [[\textbf{%s}]])
-xmltransform.add_action("strong", [[\textbf{%s}]])
-xmltransform.add_action("tt", [[\texttt{%s}]])
-xmltransform.add_action("samp", [[\texttt{%s}]])
-xmltransform.add_action("kbd", [[\texttt{%s}]])
-xmltransform.add_action("var", [[\textit{%s}]])
-xmltransform.add_action("dfn", [[\texttt{%s}]])
-xmltransform.add_action("code", [[\texttt{%s}]])
-xmltransform.add_action("a[href]", [[\textit{%s}\protect\footnote{\texttt{@{href}}}]])
+xmltransform:add_action("i", [[\textit{%s}]])
+xmltransform:add_action("em", [[\emph{%s}]])
+xmltransform:add_action("b", [[\textbf{%s}]])
+xmltransform:add_action("strong", [[\textbf{%s}]])
+xmltransform:add_action("tt", [[\texttt{%s}]])
+xmltransform:add_action("samp", [[\texttt{%s}]])
+xmltransform:add_action("kbd", [[\texttt{%s}]])
+xmltransform:add_action("var", [[\textit{%s}]])
+xmltransform:add_action("dfn", [[\texttt{%s}]])
+xmltransform:add_action("code", [[\texttt{%s}]])
+xmltransform:add_action("a[href]", [[\textit{%s}\protect\footnote{\texttt{@{href}}}]])
local itemize = [[
@@ -49,23 +49,23 @@ local itemize = [[
%s
\end{itemize}
]]
-xmltransform.add_action("ul", itemize)
-xmltransform.add_action("menu", itemize)
-xmltransform.add_action("ol", [[
+xmltransform:add_action("ul", itemize)
+xmltransform:add_action("menu", itemize)
+xmltransform:add_action("ol", [[
\begin{enumerate}
%s
\end{enumerate}
]])
-xmltransform.add_action("dl", [[
+xmltransform:add_action("dl", [[
\begin{description}
%s
\end{description}
]])
-xmltransform.add_action("li", "\\item %s\n")
-xmltransform.add_action("dt", "\\item[%s] ")
+xmltransform:add_action("li", "\\item %s\n")
+xmltransform:add_action("dt", "\\item[%s] ")
local quote = [[
\begin{quotation}
@@ -73,40 +73,40 @@ local quote = [[
\end{quotation}
]]
-xmltransform.add_action("blockquote", quote)
-xmltransform.add_action("q", "\\enquote{%s}")
-xmltransform.add_action("abbr", "%s\\protect\\footnote{@{title}}")
-xmltransform.add_action("sup", "\\textsuperscript{%s}")
-xmltransform.add_action("sub", "\\textsubscript{%s}")
+xmltransform:add_action("blockquote", quote)
+xmltransform:add_action("q", "\\enquote{%s}")
+xmltransform:add_action("abbr", "%s\\protect\\footnote{@{title}}")
+xmltransform:add_action("sup", "\\textsuperscript{%s}")
+xmltransform:add_action("sub", "\\textsubscript{%s}")
-xmltransform.add_action("table", [[
+xmltransform:add_action("table", [[
\begin{calstable}
%s
\end{calstable}
]])
-xmltransform.add_action("tr", "\\brow %s \\erow")
-xmltransform.add_action("td", "\\cell{%s}")
-xmltransform.add_action("th", "\\cell{%s}")
+xmltransform:add_action("tr", "\\brow %s \\erow")
+xmltransform:add_action("td", "\\cell{%s}")
+xmltransform:add_action("th", "\\cell{%s}")
-- this is the original code for verbatim, but I changed LuaXML to not escape characters in verbatim,
-- so we can use the verbatim environment
-xmltransform.add_action("pre", [[{\parindent=0pt\obeylines\ttfamily\catcode`\ =\active\def {\ }\catcode`\#=11%%
+xmltransform:add_action("pre", [[{\parindent=0pt\obeylines\ttfamily\catcode`\ =\active\def {\ }\catcode`\#=11%%
%s}
]], {verbatim=true})
-xmltransform.add_action("pre *", [[%s]])
+xmltransform:add_action("pre *", [[%s]])
--
-xmltransform.add_action("pre", [[
+xmltransform:add_action("pre", [[
\begin{verbatim}%s\end{verbatim}
]], {verbatim=true})
-xmltransform.add_action("details", [[%s
+xmltransform:add_action("details", [[%s
]])
-xmltransform.add_action("details summary", [[
+xmltransform:add_action("details summary", [[
\medskip
\noindent %s
@@ -114,7 +114,7 @@ xmltransform.add_action("details summary", [[
\noindent
]])
-xmltransform.add_action("figure", [[
+xmltransform:add_action("figure", [[
\begin{figure}[hbt!]
\centering
@@ -123,26 +123,26 @@ xmltransform.add_action("figure", [[
\end{figure}
]])
-xmltransform.add_action("figcaption", [[\caption{%s}]])
+xmltransform:add_action("figcaption", [[\caption{%s}]])
-xmltransform.add_action("p", [[
+xmltransform:add_action("p", [[
%s
]])
-xmltransform.add_action("br", [[\\]])
+xmltransform:add_action("br", [[\\]])
-- some fixes for weird web pages
-xmltransform.add_action("a p", [[%s]])
-xmltransform.add_action("h1 a[href], h2 a[href], h3 a[href], h4 a[href], h5 a[href], h6 a[href]", "%s")
+xmltransform:add_action("a p", [[%s]])
+xmltransform:add_action("h1 a[href], h2 a[href], h3 a[href], h4 a[href], h5 a[href], h6 a[href]", "%s")
-- mathjax is special element added by rmodepdf around LaTeX math
-xmltransform.add_action("mathjax",[[%s]], {verbatim=true,collapse_newlines=false})
+xmltransform:add_action("mathjax",[[%s]], {verbatim=true,collapse_newlines=false})
-xmltransform.add_action("hyperlink", "\\hyperlink{@{href}}{%s}")
-xmltransform.add_action("hypertarget", "\\hypertarget{@{id}}{%s}")
+xmltransform:add_action("hyperlink", "\\hyperlink{@{href}}{%s}")
+xmltransform:add_action("hypertarget", "\\hypertarget{@{id}}{%s}")
return xmltransform
I need Rmodepdf, so that there are sane default transformation rules that we can rely on. However, I would prefer not to maintain a modified copy of rmodepdf-htmltemplates.lua in the Markdown package. Would you consider releasing Rmodepdf on CTAN and making it compatible with the current version of LuaXML as in the above patch?
The problem is that this syntax is already included in TUG and CSTUG articles and also in my Youtube presentation from TUG.
I've changed the code of rmodepdf-htmltemplates.lua to allow this:
local output = require("rmodepdf-htmltemplates").process_dom(dom)
Another problem is that the documentation is still not finished, so I need to finish it before submitting to CTAN.
Anyway, I think the best thing would be to include HTML templates in LuaXML. This works with the development version:
local output = require("luaxml-htmltemplates"):process_dom(dom)
Anyway, I think the best thing would be to include HTML templates in LuaXML. This works with the development version:
Thanks, this seems perfect. After you have published the development version to CTAN, I will update the experimental defaults to use LuaXML.
Before I publish it, do you have any additional ideas for transformation rules that I could add?
Not at this moment.
OK, I've uploaded a new LuaXML version to CTAN.