markdown Use Rmodepdf and LuaXML to display block HTML elements

As discussed with @michal-h21 before and after their TUG 2024 talk (slides, preprint), we may want to look into using the LuaXML library with the default transformation rules from rmodepdf to display block HTML elements.

For inline HTML elements, this does not seem applicable, because inline HTML elements produce renderers that do not necessarily represent complete HTML fragments that can be represented in DOM:

$ docker run --rm -it witiko/markdown markdown-cli html=true <<< 'Hello <i>world</i>!'

\markdownRendererDocumentBegin
Hello \markdownRendererInlineHtmlTag{<i>}world\markdownRendererInlineHtmlTag{</i>}!\markdownRendererDocumentEnd

We can't easily change this, since the CommonMark standard allows Markdown markup within inline HTML elements.

Jul 22 '24 08:07 Witiko

You can wrap HTML fragments in some dummy element to prevent parsing issues. I also think that you can process the text nodes for Markdown, so it should be possible to use it here.

This is a proof of concept:

kpse.set_program_name "luatex"
local domobject = require("luaxml-domobject")
local transform = require("luaxml-transform")

local function parse(block)
  -- wrap the text in a container element, so it doesn't matter that the HTML markup can be incomplete
  -- <body> is a good candidate
  local dom = domobject.html_parse("<body>" .. block .. "</body>")
  return dom
end



local function should_expand(element)
  -- test if we should expand markdown in this element
  local element_name = element:get_element_name()
  -- do some tests with the element name
  -- ...
  -- for now, just return true
  return true
end


local function process_markdown(text)
  -- this is just an example. the real funtion would need to be much more complex
  text = text:gsub("%*(..-)%*", "\\textit{%1}")
  return text
end

local function expand_markdown(element)
  -- recursively loop over child elements and expand markdown in text nodes
  for i, child in ipairs(element:get_children()) do
    if child:is_element() then
      -- recurse for child elements
      expand_markdown(child)
    elseif child:is_text() and should_expand(element) then
      -- run this only on text nodes in elements that should be processed
      child._text = process_markdown(child._text)
    end
  end
end

local transformer = transform.new()

-- disable escaping of TeX commands and braces
transformer.unicodes = {
  [92] = nil,
  [123] = nil,
  [125] = nil,
}

-- actions for HTML elements
transformer:add_action("i", "\\textit{%s}")
transformer:add_action("b", "\\textbf{%s}")

local test = "Hello <i>world</i>! Another text <b>with *markdown*</b>"
local dom = parse(test)
expand_markdown(dom:root_node())

-- debugging print of the processed DOM
print(dom:serialize())

-- and now convert to TeX
print(transformer:process_dom(dom))

For this test string: "Hello world! Another text with *markdown*" it produces the following output:

<body>Hello <i>world</i>! Another text <b>with \textit{markdown}</b></body>
Hello \textit{world}! Another text \textbf{with \textit{markdown}}

Jul 22 '24 09:07 michal-h21

That's a compelling approach: First, parse the Markdown document an HTML document, construct a DOM and only then convert the text nodes from Markdown to LaTeX. However, it seems incompatible with the current approach of CommonMark in general and the Markdown package in particular, where we first parse the whole document as a Markdown document and then we identify HTML code within the document.

An alternative would be to redefine \markdownRendererInlineHtmlTag to scan ahead for all other \markdownRendererInlineHtmlTags within the same paragraph/block, replace the intervening texts with unique identifiers, process the combined text with LuaXML, and replace the identifiers with the intervening texts. For example:

Take Hello \markdownRendererInlineHtmlTag{}world\markdownRendererInlineHtmlTag{}!.
Combine all inline tags to a single string: world.
Replace the intervening texts with unique identifiers: TEXT1.
Process with LuaXML: \textit{TEXT1}
Replace the unique identifiers with intervening texts: \textit{world}.

However, this seems like a lot of plumbing in TeX, which runs the risk of breaking commands that change catcodes such as \verb in hybrid mode. A better option would be to introduce a renderer \markdownRendererInlineHtmlFragment instead, which would receive a variable number of parameters:

\markdownRendererInlineHtmlFragment{2}{<i>}{world}{</i>}

However, we can't just do that without breaking compatibility, since users may already rely on \markdownRendererInlineHtmlTag. Perhaps we can have a backwards-compatibility definition of \markdownRendererInlineHtmlFragment that would expand to \markdownRendererInlineHtmlTag{}world\markdownRendererInlineHtmlTag{} in my example. This definition would be used when the user has redefined \markdownRendererInlineHtmlTag or \markdownRendererInlineHtmlTagPrototype. This would allow us to use LuaXML with both block and inline HTML elements.

Jul 22 '24 10:07 Witiko

Well, I don't know much about CommonMark and also how the Markdown package processes the document, so I am not sure what the best way is, so I cannot comment on this :( I can only help on the LuaXML end, I am afraid.

Jul 22 '24 17:07 michal-h21

That's OK, few people do! I am happy to put in the work on the Markdown side of things.

Jul 22 '24 19:07 Witiko

A better option would be to introduce a renderer \markdownRendererInlineHtmlFragment instead, which would receive a variable number of parameters:
\markdownRendererInlineHtmlFragment{2}{}{world}{}

However, things would still break if, instead of "world", there were some brittle content that needs to appear at the top level of a file. We can still fix this by putting "world" into a separate file.

[...] we may want to look into using the LuaXML library with the default transformation rules from rmodepdf to display block HTML elements.

Come to think of it, in CommonMark, block HTML elements do not necessarily represent complete HTML fragments that can be represented in DOM either. Therefore, we would need to do something similar to the command markdownRendererInlineHtmlFragment on the level of blocks.

Both changes seem significant and possibly breaking for some users. Let's do something simpler instead and only use Rmodepdf and LuaXML for raw HTML blocks and HTML file transclusion, as these are both very likely to contain complete HTML fragments.

Jul 24 '24 12:07 Witiko

Both changes seem significant and possibly breaking for some users. Let's do something simpler instead and only use Rmodepdf and LuaXML for raw HTML blocks and HTML file transclusion, as these are both very likely to contain complete HTML fragments.

@michal-h21 Here is a demo of the new experimental LaTeX defaults for the Markdown package for TeX that use LuaXML:

\documentclass{article}
\usepackage[experimental, raw_attribute, content_blocks]{markdown}
\begin{filecontents}[overwrite, nosearch, noheader]{example.html}
<b>foo</b> <i>bar</i>
\end{filecontents}
\begin{document}
\begin{markdown}

Raw text span: `<b>foo</b> <i>bar</i>`{=html}

Raw code block:

``` {=html}
<b>foo</b> <i>bar</i>
```

Content block:

 /example.html

\end{markdown}
\end{document}

Here is the complete code that reproduces the above output with the current version of the Markdown package:

\documentclass{article}
\usepackage{markdown}
\begin{filecontents}[overwrite, nosearch, noheader]{example.html}
<b>foo</b> <i>bar</i>
\end{filecontents}
\directlua{

  function convert_html(input)
    local dom = require("luaxml-domobject").html_parse("<body>" .. input .. "</body>")
    local output = require("rmodepdf-htmltemplates"):process_dom(dom)
    return output
  end

  function print_html(input_filename)
    local input_file = assert(io.open(input_filename, "r"))
    local input = assert(input_file:read("*a"))
    assert(input_file:close())
    local output = convert_html(input)
    tex.print(output)
  end

}
\ExplSyntaxOn
\markdownSetup
  {
    raw_attribute,
    content_blocks,
    renderers = {
      inputRaw(Inline|Block) = {
        \str_case:nn
          { #2 }
          {
            { html }
              {
                \lua_now:e
                  { print_html("\lua_escape:n { #1 }") }
              }
          }
      },
      contentBlock = {
        \str_case:nn
          { #1 }
          {
            { html }
              {
                \lua_now:e
                  { print_html("\lua_escape:n { #3 }") }
              }
          }
      },
    },
  }
\ExplSyntaxOff
\begin{document}
\begin{markdown}

Raw text span: `<b>foo</b> <i>bar</i>`{=html}

Raw code block:

``` {=html}
<b>foo</b> <i>bar</i>
```

Content block:

 /example.html

\end{markdown}
\end{document}

The only issue I currently have is with Rmodepdf:

It's not on CTAN, which means that users would need to manually download rmodepdf-htmltemplates.lua.
The file rmodepdf-htmltemplates.lua seems incompatible with the current version of LuaXML and needs to be modified as follows before the above example document works with it:

diff --git a/rmodepdf-htmltemplates.lua b/rmodepdf-htmltemplates.lua
index 1d71d94..f1e06e7 100644
--- a/rmodepdf-htmltemplates.lua
+++ b/rmodepdf-htmltemplates.lua
@@ -1,10 +1,10 @@
-local xmltransform = require "luaxml-transform"
+local xmltransform = require("luaxml-transform").new()
 
 -- this trick is used to print @{} in TeX: @@{}{}
-xmltransform.add_action("head", [[
+xmltransform:add_action("head", [[
 \tableofcontents
 ]])
--- xmltransform.add_action("head", [[
+-- xmltransform:add_action("head", [[
 -- \noindent\begin{tabular}{@@{}{}p{.2\textwidth}p{.75\textwidth}@@{}{}}
 -- %s
 -- \end{tabular}\par\bigskip
@@ -12,36 +12,36 @@ xmltransform.add_action("head", [[
 -- \tableofcontents
 -- ]])
 
--- xmltransform.add_action("meta", [[\textbf{@{name}} & @{content}\\ ]])
--- xmltransform.add_action("meta[name='author']", [[\textbf{@{name}} & \textbf{@{content}}\\ ]])
--- xmltransform.add_action("title", [[\textbf{title} & %s\\ ]])
-xmltransform.add_action("img", [[\noindent\includegraphics[max width=\textwidth]{@{src}}]])
+-- xmltransform:add_action("meta", [[\textbf{@{name}} & @{content}\\ ]])
+-- xmltransform:add_action("meta[name='author']", [[\textbf{@{name}} & \textbf{@{content}}\\ ]])
+-- xmltransform:add_action("title", [[\textbf{title} & %s\\ ]])
+xmltransform:add_action("img", [[\noindent\includegraphics[max width=\textwidth]{@{src}}]])
 
-xmltransform.add_action("h1", [[\addcontentsline{toc}{section}{%s}\section*{%s}
+xmltransform:add_action("h1", [[\addcontentsline{toc}{section}{%s}\section*{%s}
 ]])
-xmltransform.add_action("h2", [[\addcontentsline{toc}{subsection}{%s}\subsection*{%s}
+xmltransform:add_action("h2", [[\addcontentsline{toc}{subsection}{%s}\subsection*{%s}
 ]])
 -- don't add lower sectioning level than subsection
-xmltransform.add_action("h3", [[\addcontentsline{toc}{subsubsection}{%s}\subsubsection*{%s}
+xmltransform:add_action("h3", [[\addcontentsline{toc}{subsubsection}{%s}\subsubsection*{%s}
 ]])
-xmltransform.add_action("h4", [[\addcontentsline{toc}{subsubsection}{%s}\subsubsection*{%s}
+xmltransform:add_action("h4", [[\addcontentsline{toc}{subsubsection}{%s}\subsubsection*{%s}
 ]])
-xmltransform.add_action("h5", [[\addcontentsline{toc}{subsubsection}{%s}\subsubsection*{%s}
+xmltransform:add_action("h5", [[\addcontentsline{toc}{subsubsection}{%s}\subsubsection*{%s}
 ]])
-xmltransform.add_action("h6", [[\addcontentsline{toc}{subsubsection}{%s}\subsubsection*{%s}
+xmltransform:add_action("h6", [[\addcontentsline{toc}{subsubsection}{%s}\subsubsection*{%s}
 ]])
 
-xmltransform.add_action("i", [[\textit{%s}]])
-xmltransform.add_action("em", [[\emph{%s}]])
-xmltransform.add_action("b", [[\textbf{%s}]])
-xmltransform.add_action("strong", [[\textbf{%s}]])
-xmltransform.add_action("tt", [[\texttt{%s}]])
-xmltransform.add_action("samp", [[\texttt{%s}]])
-xmltransform.add_action("kbd", [[\texttt{%s}]])
-xmltransform.add_action("var", [[\textit{%s}]])
-xmltransform.add_action("dfn", [[\texttt{%s}]])
-xmltransform.add_action("code", [[\texttt{%s}]])
-xmltransform.add_action("a[href]", [[\textit{%s}\protect\footnote{\texttt{@{href}}}]])
+xmltransform:add_action("i", [[\textit{%s}]])
+xmltransform:add_action("em", [[\emph{%s}]])
+xmltransform:add_action("b", [[\textbf{%s}]])
+xmltransform:add_action("strong", [[\textbf{%s}]])
+xmltransform:add_action("tt", [[\texttt{%s}]])
+xmltransform:add_action("samp", [[\texttt{%s}]])
+xmltransform:add_action("kbd", [[\texttt{%s}]])
+xmltransform:add_action("var", [[\textit{%s}]])
+xmltransform:add_action("dfn", [[\texttt{%s}]])
+xmltransform:add_action("code", [[\texttt{%s}]])
+xmltransform:add_action("a[href]", [[\textit{%s}\protect\footnote{\texttt{@{href}}}]])
 
 
 local itemize = [[
@@ -49,23 +49,23 @@ local itemize = [[
 %s
 \end{itemize}
 ]]
-xmltransform.add_action("ul", itemize)
-xmltransform.add_action("menu", itemize)
-xmltransform.add_action("ol", [[
+xmltransform:add_action("ul", itemize)
+xmltransform:add_action("menu", itemize)
+xmltransform:add_action("ol", [[
 \begin{enumerate}
 %s
 \end{enumerate}
 ]])
 
-xmltransform.add_action("dl", [[
+xmltransform:add_action("dl", [[
 \begin{description}
 %s
 \end{description}
 ]])
 
 
-xmltransform.add_action("li", "\\item %s\n")
-xmltransform.add_action("dt", "\\item[%s] ")
+xmltransform:add_action("li", "\\item %s\n")
+xmltransform:add_action("dt", "\\item[%s] ")
 
 local quote = [[
 \begin{quotation}
@@ -73,40 +73,40 @@ local quote = [[
 \end{quotation}
 ]]
 
-xmltransform.add_action("blockquote", quote)
-xmltransform.add_action("q", "\\enquote{%s}")
-xmltransform.add_action("abbr", "%s\\protect\\footnote{@{title}}")
-xmltransform.add_action("sup", "\\textsuperscript{%s}")
-xmltransform.add_action("sub", "\\textsubscript{%s}")
+xmltransform:add_action("blockquote", quote)
+xmltransform:add_action("q", "\\enquote{%s}")
+xmltransform:add_action("abbr", "%s\\protect\\footnote{@{title}}")
+xmltransform:add_action("sup", "\\textsuperscript{%s}")
+xmltransform:add_action("sub", "\\textsubscript{%s}")
 
-xmltransform.add_action("table", [[
+xmltransform:add_action("table", [[
 \begin{calstable}
 %s
 \end{calstable}
 ]])
 
-xmltransform.add_action("tr", "\\brow %s \\erow")
-xmltransform.add_action("td", "\\cell{%s}")
-xmltransform.add_action("th", "\\cell{%s}")
+xmltransform:add_action("tr", "\\brow %s \\erow")
+xmltransform:add_action("td", "\\cell{%s}")
+xmltransform:add_action("th", "\\cell{%s}")
 
 
 -- this is the original code for verbatim, but I changed LuaXML to not escape characters in verbatim,
 -- so we can use the verbatim environment
-xmltransform.add_action("pre", [[{\parindent=0pt\obeylines\ttfamily\catcode`\ =\active\def {\ }\catcode`\#=11%%
+xmltransform:add_action("pre", [[{\parindent=0pt\obeylines\ttfamily\catcode`\ =\active\def {\ }\catcode`\#=11%%
 %s}
 
 ]], {verbatim=true})
-xmltransform.add_action("pre *", [[%s]])
+xmltransform:add_action("pre *", [[%s]])
 
 -- 
-xmltransform.add_action("pre", [[
+xmltransform:add_action("pre", [[
 \begin{verbatim}%s\end{verbatim}
 ]], {verbatim=true})
 
-xmltransform.add_action("details", [[%s
+xmltransform:add_action("details", [[%s
 ]])
 
-xmltransform.add_action("details summary", [[
+xmltransform:add_action("details summary", [[
 \medskip
 \noindent %s
 
@@ -114,7 +114,7 @@ xmltransform.add_action("details summary", [[
 \noindent
 ]])
 
-xmltransform.add_action("figure", [[
+xmltransform:add_action("figure", [[
 \begin{figure}[hbt!]
 \centering
 
@@ -123,26 +123,26 @@ xmltransform.add_action("figure", [[
 \end{figure}
 ]])
 
-xmltransform.add_action("figcaption", [[\caption{%s}]])
+xmltransform:add_action("figcaption", [[\caption{%s}]])
 
 
-xmltransform.add_action("p", [[
+xmltransform:add_action("p", [[
 
 %s
 
 ]])
 
-xmltransform.add_action("br", [[\\]])
+xmltransform:add_action("br", [[\\]])
 
 -- some fixes for weird web pages
-xmltransform.add_action("a p", [[%s]])
-xmltransform.add_action("h1 a[href], h2 a[href], h3 a[href], h4 a[href], h5 a[href], h6 a[href]", "%s")
+xmltransform:add_action("a p", [[%s]])
+xmltransform:add_action("h1 a[href], h2 a[href], h3 a[href], h4 a[href], h5 a[href], h6 a[href]", "%s")
 
 
 -- mathjax is special element added by rmodepdf around LaTeX math
-xmltransform.add_action("mathjax",[[%s]], {verbatim=true,collapse_newlines=false})
+xmltransform:add_action("mathjax",[[%s]], {verbatim=true,collapse_newlines=false})
 
-xmltransform.add_action("hyperlink", "\\hyperlink{@{href}}{%s}")
-xmltransform.add_action("hypertarget", "\\hypertarget{@{id}}{%s}")
+xmltransform:add_action("hyperlink", "\\hyperlink{@{href}}{%s}")
+xmltransform:add_action("hypertarget", "\\hypertarget{@{id}}{%s}")
 
 return xmltransform

I need Rmodepdf, so that there are sane default transformation rules that we can rely on. However, I would prefer not to maintain a modified copy of rmodepdf-htmltemplates.lua in the Markdown package. Would you consider releasing Rmodepdf on CTAN and making it compatible with the current version of LuaXML as in the above patch?

Nov 13 '24 17:11 Witiko

The problem is that this syntax is already included in TUG and CSTUG articles and also in my Youtube presentation from TUG.

I've changed the code of rmodepdf-htmltemplates.lua to allow this:

local output = require("rmodepdf-htmltemplates").process_dom(dom)

Another problem is that the documentation is still not finished, so I need to finish it before submitting to CTAN.

Anyway, I think the best thing would be to include HTML templates in LuaXML. This works with the development version:

    local output = require("luaxml-htmltemplates"):process_dom(dom)

Nov 14 '24 14:11 michal-h21

Anyway, I think the best thing would be to include HTML templates in LuaXML. This works with the development version:

Thanks, this seems perfect. After you have published the development version to CTAN, I will update the experimental defaults to use LuaXML.

Nov 14 '24 19:11 Witiko

Before I publish it, do you have any additional ideas for transformation rules that I could add?

Nov 14 '24 19:11 michal-h21

Not at this moment.

Nov 14 '24 23:11 Witiko

OK, I've uploaded a new LuaXML version to CTAN.

Nov 15 '24 10:11 michal-h21

markdown markdown copied to clipboard

Use Rmodepdf and LuaXML to display block HTML elements

markdown
markdown copied to clipboard