pandoc-crossref icon indicating copy to clipboard operation
pandoc-crossref copied to clipboard

confirming input format restriction

Open exl022 opened this issue 4 years ago • 10 comments

I wanted to confirm that this works only from .md to other files like .docx? It won't work from .tex to .docx? I was trying this, since I have .tex files, but it doesn't seem to work. I see from some examples of users that they can use this from .tex, but I think that it doesn't work.

exl022 avatar Mar 07 '20 02:03 exl022

The principal aim is to add cross-references to Markdown, yes. Some other input formats happen to work also. No, TeX in general is not recognized, although under some special conditions it can kind of work (with pandoc 2.9.2 or later specifically, it should work at least for figures and tables, if labels are in pandoc-crossref format, i.e. fig:... for figures, etc)

Side note, converting TeX to docx in general is nigh impossible, that was one of the motivations for me to switch to Markdown for my writing and making pandoc-crossref in the first place.

lierdakil avatar Mar 07 '20 05:03 lierdakil

Okay, i understand, thanks for the clarification. Then one route I want to explore is converting from .tex to .text (markdown file). I realize that you pointed out that the starting point is supposed to be from markdown, not tex, but I'm still trying to figure out how this works with cleveref, since this as highlighted feature.

There is a lot content that comes over just fine in format. However, I can't see the numbered references (e.g., the table numbers, are dropped, figure numbers, equation numbers), and the \ref, \pageref and \label relationships are also dropped . I have used the cleveref package in the past, but I can change my workflow to avoid this if it works better. I see in the markdown file that there is some syntax (it looks foreign to me, since I'm not that familiar with markdown - yet) in the markdown file, but when I move that to word, the ref labels are printed in the text, and the \refs and \pageref pointers don't work. Any guidance on getting from .tex to .text (markdown) preserving reference numbers?

exl022 avatar Mar 07 '20 15:03 exl022

pandoc-crossref supports generating cleveref-flavored LaTeX as output, not reading it as input. You know, in case you want to get a reasonable LaTeX document to send to your publisher or something.

So, yeah, I forgot to mention one detail: when reading LaTeX, pandoc doesn't really handle \refs the way pandoc-crossref expects it to. It does try to put numbers in for plain old \refs, but it doesn't fare that well with \crefs, not in general anyway. So anyway, I've slapped together a lua filter that tries to convert references to the format pandoc-crossref expects (as a bonus, it will also try to add labels to display math):

function Link(el)
  if el.attributes["reference-type"]=="ref" then
    local citations = {}
    for cit in el.attributes["reference"]:gmatch('[^,]+') do
      citations[#citations+1] = pandoc.Citation(cit, "NormalCitation")
    end
    return pandoc.Cite("", citations)
  end
end

function Math(el)
  if el.mathtype == "DisplayMath" then
    local label = nil
    el.text = el.text:gsub("\\label{[^}]+}", function(w) label=w:sub(8,-2); return ""; end)
    if label ~= nil then
      return pandoc.Span(el, {id=label})
    end
  end
end

Save this as texref.lua (or another name if you so desire), then run pandoc with

pandoc -f latex -t markdown --lua-filter texref.lua <your-input-file>.tex -o <your-output-file>.markdown

I'm assuming texref.lua is in the current working directory, otherwise you'll need to use full path to it. Then you can convert output markdown to, say, docx, using pandoc-crossref. Or if you're not interested in markdown output itself, you might just skip a step and use something like

pandoc -f latex -t docx --lua-filter texref.lua --filter pandoc-crossref <your-input-file>.tex -o <your-output-file>.docx

This is far from perfect, but should produce reasonable results at least on some cleveref documents (well, actually, I've checked one, and it works, but YMMV). Bear in mind it's code I've literally thrown together in less than half an hour, so don't expect a miracle.

For plain old \refs you might get better results if you use SuppressAuthor instead of NormalCitation in texref.lua. In a nutshell, this SuppressAuthor will make "dumb" references, i.e. no "fig."/"figs." prefix, etc, only number, which should be more or less consistent with how \ref works.

lierdakil avatar Mar 07 '20 17:03 lierdakil

Also, yeah, pandoc-crossref doesn't do \pagerefs, because pandoc's document model doesn't do pages. Which is expected, because in most output formats, there isn't such a thing as "page" anyway, and if it is, it's not well-defined anyway.

lierdakil avatar Mar 07 '20 17:03 lierdakil

I'm not using the LUA filter above. Then, oddly, when I convert from LaTeX (with amsmath) to markdown, I get labels of subequations environments correctly displayed, and working anchors in HTML generated from this markdown, but never for equation labels themselves. Specifically,

\begin{subequations}
  \label{xyODE}
  \begin{align}
    \label{xODE}
    \frac{dx}{dt}=&\,y\\
    \label{yODE}
    \frac{dy}{dt}=&-100\,(y-2\, x).
  \end{align}
\end{subequations}

converts to

[xyODE]{#xyODE label="xyODE"} $$\begin{aligned}
    \label{xODE}
    \frac{dx}{dt}=&\,y\\
    \label{yODE}
    \frac{dy}{dt}=&-100\,(y-2\, x).
  \end{aligned}$$

but this

\begin{align}
  \label{ySOL}
  y(t)=2\,x+C\,\exp(-100 t),
\end{align}

becomes

$$\begin{aligned}
  \label{ySOL}
  y(t)=2\,x+C\,\exp(-100 t),\end{aligned}$$ 

So, the trick might be to understand why it works for subequations to then make it work for other displayed maths as well.

pipapu avatar Aug 11 '20 09:08 pipapu

@pipapu, apparently, this is something that pandoc does on its own, so this is definitely the wrong repo to discuss this.

Also, this looks like a defect in pandoc's LaTeX parser honestly, the output is at best questionable and at worst completely wrong. The reason seems to be the complete lack of support for subequations environment in pandoc's LaTeX reader, which is ignored, and questionable handling of standalone \labels. For instance, consider:

$ pandoc -f latex -t markdown <<< '\label{test}'
[\[test\]]{#test label="test"}

lierdakil avatar Aug 13 '20 08:08 lierdakil

I developed a slightly enhanced version of the filter texref.lua suggested by @lierdakil above:

function Link(el)
	if el.attributes["reference-type"]=="ref" or el.attributes["reference-type"]=="eqref" then
		local citations = {}
		for cit in el.attributes["reference"]:gmatch('[^,]+') do
			-- cit = cit:gsub(":","-")        -- for quarto
			citations[#citations+1] = pandoc.Citation(cit, "NormalCitation")
	  	end
		return pandoc.Cite("", citations)
	end
end
  
function Math(el)
	if el.mathtype == "DisplayMath" then
	  	local label = nil
	  	el.text = el.text:gsub("\\label{[^}]+}", function(w) label=w:sub(8,-2); return ""; end)
	  	raw_text = '$$\n' .. el.text ..'\n$$' -- for quarto move $$ delimiter to new lines 
	  	if label ~= nil then
			-- label = label:gsub(":","-")        -- for quarto
			raw_text = raw_text .. '{#' .. label .. '}'
	  	end	
	  	return pandoc.RawInline('markdown', raw_text)
	end
end

In the function Link we are also looking for reference type "eqref", because this is a commonly used command to refer to equations in LaTeX. In the function Math we do return a RawInline instead of a Span, because the latter will wrap the display math into unnecessary square brackets.

In case you want to use https://quarto.org/ you should remove the comments from the line with the comment -- for quarto. Here we will replace : with - in the reference labels, because quarto uses a slightly different prefix convention for labels than pandoc-crossref (see https://quarto.org/docs/authoring/cross-references.html ).

asmaier avatar Jun 28 '22 20:06 asmaier

Tried the solution suggested by @lierdakil above. Command line is

pandoc -f latex -t markdown --lua-filter texref_orig.lua mwe.tex -o out.markdown

where the filter script texref_orig.lua is as given by @lierdakil above, and mwe.tex is

\documentclass{article}
\usepackage{amsmath}
\renewcommand{\eqref}[1]{(\ref{#1})}
\begin{document}
	\section{First Section}\label{sec:s1}
	Some text.
    \begin{equation}\label{eq:e1} % Creates an equation environment and is compiled as math
		\gamma^2+\theta^2=\omega^2
	\end{equation}
	Some more text.
    \begin{equation}\label{eq:e2} % Creates an equation environment and is compiled as math
    	\gamma^2+\theta^2=\omega^2
    \end{equation} 
    Let us refer to Eqs.~(\ref{eq:e1}), \eqref{eq:e2}.
    
	\section{Second section}\label{sec:s2}
    And now we refer to Sections~\ref{sec:s1} and \ref{sec:s2}.
\end{document}

The output has math environments encompassed by square brackets as shown below, which does not seem correct.

# First Section {#sec:s1}

Some text. [$$ % Creates an equation environment and is compiled as math
        \gamma^2+\theta^2=\omega^2$$]{#eq:e1} Some more text.
[$$ % Creates an equation environment and is compiled as math
        \gamma^2+\theta^2=\omega^2$$]{#eq:e2} Let us refer to
Eqs. ([@eq:e1]), ([@eq:e2]).

# Second section {#sec:s2}

And now we refer to Sections [@sec:s1] and [@sec:s2].

On the contrary, the script supllied in a later comment in this thread works as expected (no brackets around math environments).

Windows 10, pandoc 3.1.8, pandoc-crossref 0.3.17.0

okanakov avatar Dec 13 '23 17:12 okanakov

Tried converting latex to docx in a single invocation of pandoc, as suggested by @lierdakil above, but unsuccessfully. The command line is

pandoc mwe.tex -f latex --lua-filter texref.lua --filter pandoc-crossref -M crossrefYaml=crossref_setup.yml -t docx+native_numbering -o crossref.docx

where the filter script texref.lua is taken from a post above instead of the original @lierdakil 's version for reasons which I reported above, mwe.tex is the same as in my post above, and crossref_setup.yml is as follows:

numberSections: true        # include section numbers in headings
sectionsDepth: 2            # headings up to this level get numbered
tableEqns: true             # use tables to align equations and their numbers
autoEqnLabels: true         # number all equations
eqnPrefixTemplate: $$i$$    # references consist of numbers only (no prefix)
secPrefixTemplate: $$i$$
figPrefixTemplate: $$i$$
tblPrefixTemplate: $$i$$

The command produces the following messages, and the references to equations do not get resolved in the output docx file.

Undefined cross-reference: eq:e2
Undefined cross-reference: eq:e1

Apparently, for the lua filter to work properly, it is essential that markdown is used as an intermediate format. This can be achieved in a single command line containing two pandoc invocations chained via the OS shell pipe, thus avoiding the creation of an intermediate markdown file. The following command line WORKS AS EXPECTED, producing a correct docx output:

pandoc mwe.tex -f latex --lua-filter texref.lua -t markdown | pandoc -f markdown --filter pandoc-crossref -M crossrefYaml=crossref_setup.yml -t docx+native_numbering -o crossref.docx

As a side note: if markdown in the command line above is replaced by native, then the command fails in the same exact way as the first command line in this post (i.e. as the single invocation of pandoc). This seems expectable, because these seem to be exactly equivalent in terms of the actual processing workflow performed.

Windows 10, pandoc 3.1.8, pandoc-crossref 0.3.17.0

okanakov avatar Dec 14 '23 11:12 okanakov

Notice that the lua script suggested above replaces each \label{...} command in a math environment with an empty string, which may result in an empty line. This destroys LaTeX code, because empty lines have a special meaning of paragraph breaks, and are forbidden in math mode. In order to prevent this, I suggest getting rid of all newline characters in a math environment before doing the label replacement. I know only one other special meaning of newline character in LaTeX (in addition to empty lines), which is to terminate a LaTeX comment (starting with a percentage symbol). Therefore, I suggest performing two preliminary replacements in the filter script before the label replacement:

  1. replace all comments with an empty string;
  2. replace all newline characters with whitespace.

Finally, my version of the lua filter script is as follows:

function Link(el)
	if el.attributes["reference-type"]=="ref" or el.attributes["reference-type"]=="eqref" then
		local citations = {}
		for cit in el.attributes["reference"]:gmatch('[^,]+') do
			-- cit = cit:gsub(":","-")        -- for quarto
			citations[#citations+1] = pandoc.Citation(cit, "NormalCitation")
	  	end
		return pandoc.Cite("", citations)
	end
end

function Math(el)
	if el.mathtype == "DisplayMath" then
	  	local label = nil
                el.text = el.text:gsub("%%.-\n","") -- remove comments
                el.text = el.text:gsub('\n',' ')  -- replace line breaks with whitespace
	  	el.text = el.text:gsub("\\label{[^}]+}", function(w) label=w:sub(8,-2); return ""; end)
	  	raw_text = '$$\n' .. el.text ..'\n$$' -- for quarto move $$ delimiter to new lines
	  	if label ~= nil then
			-- label = label:gsub(":","-")        -- for quarto
			raw_text = raw_text .. '{#' .. label .. '}'
	  	end
	  	return pandoc.RawInline('markdown', raw_text)
	end
end

okanakov avatar Dec 14 '23 15:12 okanakov