Automatic conversion to LaTeX source
For some research fields, like math or theoretical physics, one must submit the LaTeX source of the manuscript for publication in, eg. APS journals. It would therefore be very convenient if manubot could also have a LaTeX + BibTeX conversion mode for such cases. iirc pandoc supports conversion to TeX?
This concept makes sense for submitting to journals that prefer LaTeX over DOCX for submissions. It should be feasible because you are correct that pandoc supports conversion to tex. We could add a BUILD_TEX option to the build script that generates the tex output when requested, similar to the optional DOCX output
https://github.com/manubot/manubot/issues/68 discussed some earlier attempts to generate tex. We would need to work on a stable way to get LaTeX working in the continuous integration environment or use a Docker environment for this step.
In the short term, we could also work on an example pandoc command to guide users who want to do this outside of the build script as a final step before journal submission.
Perhaps docker is the way to go, I see it is also used in VScode LaTeX extension
There are two steps here I believe:
-
Convert the Markdown manuscript to a
.texfile, for example by usingpandoc --to=latex. Should we also output a.bibfile with the reference metadata, or should this be included in the .tex? -
Rendering the
.texfile as a PDF. This is where using a Docker image probably makes sense.
It sounds like 1 is what is necessary for submission to journals, although 2 would be nice so the LaTeX compiler could detect errors and you could view the output PDF to make sure everything converted properly.
@slochower do you have an implementation of either of these steps? How much will users need to customize these steps? How does customization work / at what stage... isn't there some way to apply a template/style for a specific journal?
The big benefits of adding LaTeX support that I see are:
-
journal submission via
.tex. -
another route to create PDFs, using existing infrastructure for branded PDFs. This could help Manubot become the primary document generation system for journals which require stylized PDFs.
-
enabling latexdiff to track changes between manuscript versions.
@slochower do you have an implementation of either of these steps? How much will users need to customize these steps? How does customization work / at what stage... isn't there some way to apply a template/style for a specific journal?
Point 1 is (relatively) easy, as you say. We can just use pandoc. I used something like this (with a custom template file):
if [ "$BUILD_LATEX" = "true" ];
then
echo "Exporting LATEX manuscript"
pandoc \
--from=markdown \
--to=latex \
--filter=pandoc-fignos \
--filter=pandoc-eqnos \
--filter=pandoc-tablenos \
--filter=pandoc-img-glob \
--bibliography=$BIBLIOGRAPHY_PATH \
--csl=$CSL_PATH \
--template=build/assets/nih4.tex \
--metadata link-citations=true \
--number-sections \
--resource-path=.:content \
-s --output=output/manuscript.tex \
$INPUT_PATH
fi
IIRC, --resource-path was necessary so that the image path embedded in manuscript.tex matched the image location in our folder structure.
Point 2, also as you point out, is a little more tricky. I implemented it this way:
if [ "$BUILD_PDF_VIA_LATEX" = "true" ];
then
echo "Exporting LATEX (PDF) manuscript"
FONT="Helvetica"
COLORLINKS="true"
pandoc \
--from=markdown \
--filter=pandoc-eqnos \
--filter=pandoc-tablenos \
--filter=pandoc-img-glob \
--filter=pandoc-chemfig \
--filter=pandoc-fignos \
--lua-filter=build/latex-color.lua \
--bibliography=$BIBLIOGRAPHY_PATH \
--csl=$CSL_PATH \
--template=build/assets/nih4.tex \
--metadata link-citations=true \
--resource-path=.:content:../content \
--pdf-engine=xelatex \
--variable mainfont="${FONT}" \
--variable sansfont="${FONT}" \
--variable colorlinks="${COLORLINKS}" \
--output=output/manuscript.pdf \
$INPUT_PATH
fi
But I did not have this running via CI (only locally). Here I used pandoc-img-glob to move the images to a temporary directory with the tex for compilation and changed the --resource-path accordingly. Getting something like this to work would probably require docker or waiting a long time for an apt-get install texlive (or similar) to run on Travis. I used xelatex because I wanted the grant application to be in Helvetica, FWIW.
Regarding latexdiff. I implemented a quick-and-dirty solution that might be useful in the future. You can see it here.
Another benefit of supporting LaTeX could be enabling Manubot-based writing of documents that have precise formatting requirements, like grant applications or university dissertations. I haven't tested this so it's unclear to me how much the pandoc tex template helps with that or whether the final formatting steps would have to be manual after the content is finalized. Perhaps this is the same idea as "branded PDFs" above.
@agitter agreed, although I found the pandoc template system cumbersome. See this existing list: https://github.com/jgm/pandoc/wiki/User-contributed-templates. There are many $if$-$endif$ blocks.
@slochower for the BUILD_PDF_VIA_LATEX step, would it be possible to take the output .tex file from the earlier pandoc --to=latex command and pass it directly to the latex compiler? Is there any benefit to running pandoc twice? I was envisioning that once we had a .tex and .bib file, we would no longer need to use pandoc.
I think using the output of pandoc --to=latex should work, modulo the figure paths. I think when I did this earlier, I specified the figures in the current path, e.g., [Caption.](figure.png) in the Markdown (as usual). If you covert this to LaTeX (without having pandoc make the PDF for you), I think you'll need to either symlink the figures to the .tex directory or vice versa.
Here's a useful resource on different ways to install LaTeX on Travis CI. It mentions tectonic, which seems to be a more user-friendly xelatex (although I don't have a good understanding of how all the LaTeX infrastructure fits together).
I've got the pandoc --to=pdf --pdf-engine=xelatex workflow to produce a PDF. However, I'd like to see what Pandoc does to generate the LaTeX, so that we can potentially replicate it in an output/latex directory that could contain a standalone LaTeX source. Pandoc creates a temporary directory as part of the LaTeX processing, but there is no builtin way to retain that directory (see https://github.com/jgm/pandoc/issues/2288).
Setting --pdf-engine-opt=-output-directory=output/latex did write some files including the pdf to output/latex/input.*, but then Pandoc erred with Error producing PDF.. Possible related discussion at https://github.com/jgm/pandoc/issues/4721.
Should we also output a
.bibfile with the reference metadata, or should this be included in the .tex?
Usually, from what I've seen submission systems that can ingest .tex files require the bibliography to be included in the file, so I'd vote for the latter.
I'd also really like to have a LaTeX file as output, so I don't have to fiddle with pandoc myself.
I'm happy to do some more manual work with it (e.g. applying a template myself), but having a .tex file with bibliography would speed up the process greatly.
If manubot generates a .tex file somewhere, then https://github.com/xu-cheng/latex-action might be of help, which I've used to compile such a .tex to a PDF with GitHub Actions.
having a .tex file with bibliography would speed up the process greatly.
@habi I propose the simplest possible LaTeX export in https://github.com/manubot/rootstock/pull/384. In https://github.com/manubot/rootstock/pull/256, I tried to get the LaTeX to compile and render as PDF, which proved challenging. But perhaps having a .tex file will help you to a sufficient extent. So please check our #384 and let us know whether it works for your application.
I'm sharing some notes from our Manubot manuscript that we exported to LaTeX for a conference submission. There are more details at https://github.com/greenelab/covid19-review/pull/943.
We customized a LaTeX template for the conference style. Like @slochower, we also found the templating system cumbersome. The Manubot metadata didn't perfectly fit the template expectations so we had a bidirectional process of modifying the template and the metadata. Getting the authors to show up correctly was the trickiest part. We created a new metadata.yaml file using a Python script, in part because we were already modifying metadata.yaml programmatically because this conference submission was one piece of a larger project.
It's very helpful that newer versions of pandoc can convert the CSL JSON file Manubot produces into a .bib file. I typically submit a .bib file to a conference or journal instead of embedding references in a .tex file. We added this conversion step to the build script. We used a regex to strip out the note fields from the .bib file. We also used custom pandoc settings in a new yaml file, including cite-method: natbib for the references.
We didn't try to build a PDF with continuous integration. We used Manubot to get 95% of the way to submission automatically and then fine-tuned LaTeX issues in Overleaf before submitting.
Here's an alternative way, if possibly more buggy:
If you use markdown module for lualatex (which is easily enabled in Overleaf), you can operate in "dual" mode by having an alternative document.tex that uses \markdownInput{../content/10.introduction.md} etc.
There will be some fights over things like figures and internal references.
See https://github.com/stain/ro-crate-paper/blob/master/latex/ro-crate.tex and workarounds for manubot in https://github.com/stain/ro-crate-paper/blob/master/build/build.sh#L14
This allowed us to edit the manuscript in Overleaf, while also having Manubot rendering using the Overleaf-GitHub sync
You have been warned - this approach will let you conform to the journal style - but will also come with lots of new caveats.