pandoc icon indicating copy to clipboard operation
pandoc copied to clipboard

pandoc modifies tex environments

Open mtomassoli opened this issue 8 years ago • 13 comments

Consider this minimal tex file:

\documentclass[english]{article}
\begin{document}

\begin{align}
    x &= 3\\
    y &= 2
\end{align}

\end{document}

pandoc test.tex -o test.md produces test.md file with the following content:

$$\begin{aligned}
    x &= 3\\
    y &= 2\end{aligned}$$

If I use a filter, the filter receives aligned rather than align so it's already too late.

The only workaround I found is to wrap the env in $$ but to do that programmatically I'd need to properly parse the tex file.

Is this a bug or what?

I forgot:

pandoc 1.19.2.1
Compiled with pandoc-types 1.17.0.4, texmath 0.9, skylighting 0.1.1.4

on Windows 10.

mtomassoli avatar Nov 28 '17 01:11 mtomassoli

This is intentional, actually, but maybe should be reconsidered.

We have a choice here betwene parsing the align environment as raw latex or as a math element. If we parse it as math, then we need to switch align to aligned (which can occur within display math).

The advantage of parsing it as math is that we get equations in all the output formats for which pandoc supports math (including e.g. docx). If we parsed it as raw latex, then the environment would simply not appear in docx output.

+++ Massimiliano Tomassoli [Nov 28 17 01:31 ]:

Consider this minimal tex file: \documentclass[english]{article} \begin{document}

\begin{align} x &= 3\ y &= 2 \end{align}

\end{document}

pandoc test.tex -o test.md produces test.md file with the following content: $$\begin{aligned} x &= 3\ y &= 2\end{aligned}$$

If I use a filter, the filter receives aligned rather than align so it's already too late.

The only workaround I found is to wrap the env in $$ but to do that programmatically I'd need to properly parse the tex file.

Is this a bug or what?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, [1]view it on GitHub, or [2]mute the thread.

References

  1. https://github.com/jgm/pandoc/issues/4104
  2. https://github.com/notifications/unsubscribe-auth/AAAL5B-DwARMwEp33VIPed39KJ9W526Gks5s62JlgaJpZM4QsjZ2

jgm avatar Nov 28 '17 01:11 jgm

One good solution might be to modify mathEnvWith in the LaTeX reader so that, if the raw_tex extension is enabled, these environments are parsed as raw latex; otherwise, we do as before and parse as math with necessary modifications.

Since raw_tex is enabled by default in pandoc markdown, this might mean that some existing documents would break on conversion to Word, so that's a potential worry.

jgm avatar Nov 28 '17 02:11 jgm

Couldn't you enable this behavior just for mathjax?

mtomassoli avatar Nov 28 '17 02:11 mtomassoli

+++ Massimiliano Tomassoli [Nov 28 17 02:30 ]:

Couldn't you enable this behavior just for mathjax?

Remember that --mathjax affects writers. The issue here is how the input is parsed. Readers and writers are conceptually separate in pandoc and don't affect each other.

jgm avatar Nov 28 '17 05:11 jgm

Would it be possible to preserve information about the original math environments as some kind of "metadata" so that filters could recover them? That way you wouldn't break anything.

mtomassoli avatar Nov 28 '17 11:11 mtomassoli

One good solution might be to modify mathEnvWith in the LaTeX reader so that, if the raw_tex extension is enabled, these environments are parsed as raw latex; otherwise, we do as before and parse as math with necessary modifications.

Since raw_tex is enabled by default in pandoc markdown, this might mean that some existing documents would break on conversion to Word, so that's a potential worry.

I was wondering if there were any updates apropos your comment above. It seems, as of pandoc 2.10, that the raw_tex extension for the LaTeX reader is doing less than for the markdown reader? For example given the simple LaTeX snippet

% emc2.tex
\begin{equation}
  E=mc^2
\end{equation}

I see that

pandoc --mathjax --from latex+raw_tex emc2.tex --to html

outputs

<p><span class="math display">\[E=mc^2\]</span></p>

This doesn't really integrate well with MathJax, e.g. if one wishes to use MathJax to process numbered equation references directly.

For context, I was hoping to use pandoc to produce a static site and/or a personal journal where the content documents were simple MathJax compatible LaTex instead of markdown (to leverage tex editor plugins), but the modification of basic LaTeX environments by the LaTeX reader seems to be blocker for now. Is there any way around this for now that doesn't involve modifying the pandoc codebase itself?

shawnohare avatar Oct 15 '20 01:10 shawnohare

Indeed, an equation environment is parsed as a Math element (rather than raw tex) even if raw_tex is set. We could try changing that, but it may have some unanticipated consequences.

jgm avatar Oct 15 '20 17:10 jgm

As a workaround, you could try using a custom environment (not equation). When compiling with latex you could simply define this as equivalent to equation.

jgm avatar Oct 15 '20 17:10 jgm

Would it be possible to move the special handling of align environments from the reader to the writer? TeXMath seems to parse the align environment ok. Are there other issues that would make such a change problematic?

tarleb avatar Jun 15 '22 08:06 tarleb

Would it be possible to move the special handling of align environments from the reader to the writer? TeXMath seems to parse the align environment ok. Are there other issues that would make such a change problematic?

That's an interesting question. The general expectation is that the contents of a Math element should be something that is valid in math mode in LaTeX (as the align environment isn't).

However, if texmath can handle all of the special math environments we handle in the reader (not just align), then a case could be made that we don't need to enforce this expectation. The real question then would be about formats where we pass through the LaTeX math unchanged (as opposed to converting with texmath). One of those is LaTeX itself, and if that were the only one, we could move this code to the writer. But there are others. The one I'm thinking of at the moment is HTML. Oddly, though, I think mathjax actually does allow align environments inside math contexts - even though they're NOT allowed by LaTeX itself. So maybe it could work. There are other formats to consider too -- org, maybe rst? I'd be reluctant to make such a change without a lot of further research.

jgm avatar Jun 16 '22 03:06 jgm

@tarleb It would be good to hear about why you propose this. If it's because you'd like a writer or filter to know whether align or aligned was used in the original, we could perhaps address that by having the reader add a containing span with an attribute when an environment like align is downshifted to its math-mode equivalent.

jgm avatar Jun 16 '22 17:06 jgm

This came up in a discussion about editor support for Quarto's Markdown. More specifically, the question was whether math should always be a dollar-delimited entity, or whether it can make more sense to use raw LaTeX for some cases. The respective issues is linked above.

The problems that came up could be solved with a filter. But there are issues, like #8122 and this one, that make me think whether a more fundamental change might be a better long term solution. It might also make it easier to add support for align to Org (#6703).

I don't understand all the constraints well enough to have a full formed opinion, it's more that I'm thinking out loud.

tarleb avatar Jun 16 '22 20:06 tarleb

Thealign environment is fundamental for any scientific text, I am very surprised this issue is still open after so many years. Any place to start looking @jgm ?

epignatelli avatar Aug 26 '22 09:08 epignatelli

Would it be possible to preserve information about the original math environments as some kind of "metadata" so that filters could recover them? That way you wouldn't break anything.

This solution mentioned seems like a good one. Doing something like enabling attributes on the Math data type would fix this problem of determining whether something is an align environment and would also allow for equation labels and numbering (cf. this pull request jgm/pandoc-types#97). I'm assuming the downside is that it requires an API change?

ajdobner avatar Jun 22 '23 02:06 ajdobner