pandoc
pandoc copied to clipboard
Reproducible Markdown to PDF conversion
I see that running pandoc gfm converter multiple times over the same input produces new binary PDF every time. Probably the PDF contains the date when the document was generated. However, I'd like to avoid generating a new PDF if the source doesn't change.
The use case is that I maintain my docs in repository, and I don't want to commit PDF just because they were regenerated as part of total rebuild.
This isn't really Pandoc's problem. Pandoc itself doesn't write PDFs at all, it outsources that to other engines and virtually no PDF engines produce reproducible PDF builds. Fixing that would be up to the engines, not Pandoc.
I use Pandoc in workflows for a book publishing company and ran into this as well. I don't care about commits because I'm not keeping generated binary artifacts in source repositories (and you probably shouldn't either, there are other ways to release artifacts), but blindly regenerating them does take lots of CPU time. The solution is to use a build system that keeps track of what source files are used to generate what products and how to update them if something changes. My solution uses GNU Make for this, but there are lots of less esoteric build systems as well that accomplish the same thing.
I don't want to introduce the stateful build system, because I want to run CI pipeline that would check that all version controlled PDF are up to date. The CI is doing all things from scratch.
If I am using --pdf-engine xelatex and if the engine is already supports reproducible builds, it there a way to pass the required parameters into it?
I am looking at the https://tex.stackexchange.com/questions/229605/reproducible-latex-builds-compile-to-a-file-which-always-hashes-to-the-same-va right now. That SOURCE_DATE_EPOCH may do the trick.
One last tip — while PDF files are usually not usually binary reproducible, they often are visually deterministic. You can use a PDF diff tool such as diff-pdf to compare whether anything visually changed between PDF rendering passes. Armed with that knowledge you can choose to discard the newer one as unchanged if you like. Not quite as good as knowing when you need to rebuild it in the first place since it wastes the time building it, but still a useful trick.
I don't want to introduce the stateful build system, because I want to run CI pipeline that would check that all version controlled PDF are up to date. The CI is doing all things from scratch.
You don't need "a stateful build system" for this (at least not in the sense you mean), but you also aren't working totally from scratch either because you have the last generated artifact plus the current sources. Since you mentioned these being part of a VCS system you also have the history and can determine whether any of the sources have been updated more recently than the PDF build
The https://diffoscope.org/ shows that only the date is different. Going to try that.
$ podman run --rm -t -w $(pwd) -v $(pwd):$(pwd):Z,ro \
registry.salsa.debian.org/reproducible-builds/diffoscope 001cv.pdf anatoli.cv.pdf --text-color=always
--- 001cv.pdf
+++ anatoli.cv.pdf
│ --- 001cv.pdf
├── +++ anatoli.cv.pdf
│┄ Document info
│ @@ -1,3 +1,3 @@
│ -CreationDate: "D:20200716112355+03'00'"
│ +CreationDate: "D:20200716120342+03'00'"
│ Creator: 'LaTeX via pandoc'
│ Producer: 'xdvipdfmx (20190225)'
podman run --rm -t -w $(pwd) -v $(pwd):$(pwd):Z,ro 001cv.pdf anatoli.cv.pdf
@alerque repository timestamps doesn't prove that PDF was generated from the sources that were committed. Only build state can determine that, such as file hashes in SCons etc.
it outsources that to other engines and virtually no PDF engines produce reproducible PDF builds
FYI, I know that ReportLab, for one, provides this as an option. You can see the option in the code here:
https://github.com/MrBitBucket/reportlab-mirror/blob/67281aea11a81a7768c386d353334e328840b129/src/reportlab/rl_settings.py#L83
Setting SOURCE_DATE_EPOCH helped for xelatex.
export SOURCE_DATE_EPOCH=2461633620
But it is not enough.
--- 001cv.pdf
+++ anatoli.cv.pdf
├── dumppdf -adt {}
│ @@ -357,16 +357,16 @@
│ <value><literal>XRef</literal></value>
│ <key>Root</key>
│ <value><ref id="1" /></value>
│ <key>Info</key>
│ <value><ref id="2" /></value>
│ <key>ID</key>
│ <value><list size="2">
│ -<string size="16">ž$ð}‡FïsV]	Ñ2·©</string>
│ -<string size="16">ž$ð}‡FïsV]	Ñ2·©</string>
│ +<string size="16">'M%óôÚçÃ1¡Ôõ¦</string>
│ +<string size="16">'M%óôÚçÃ1¡Ôõ¦</string>
│ </list></value>
...
it there a way to pass the required parameters into it?
yes, see https://pandoc.org/MANUAL.html#option--pdf-engine-opt
Hey @mb21 if I am reading that last diff posted by @abitrolly correctly then my initial comment was wrong and this is at least partially Pandoc's problem. That looks like cross reference IDs are changing between successive runs. If that's actually the case (and the testing isn't flawed) that's something that should be fixed. XRef content should be deterministic. I'd keep this open at least until it's determined whether Pandoc's output is deterministic. How to make PDF engines follow suit is another story of course.
I managed to do this. Not very user friendly, because it relies on shell, needs external file, and makes the file specific to the used engine. There is no command line option to easily wrap these things.
The working recipe for the xelatex engine is the following.
- [x] Set environment variable
SOURCE_DATE_EPOCHto some fixed value in build scripts, likeSOURCE_DATE_EPOCH=2461633620 - [x] Create separate tex file with
xelatexoption for reproducible builds (I named mineanatoli.head.tex)
\special {pdf:trailerid [
<00112233445566778899aabbccddeeff>
<00112233445566778899aabbccddeeff>
]}
- [x] Reference variable, engine and the file in
pandoccommand line. Mine bash specific command.
SOURCE_DATE_EPOCH=2461633620 pandoc \
--from markdown_github+yaml_metadata_block \
--pdf-engine xelatex \
--include-in-header anatoli.head.tex \
anatoli.cover.md -o anatoli.cover.pdf
So the \special{pdf:trailerid [ ... ]} thing solved the XRef related string changing?
I suppose Pandoc could be extended with a flag that then translates all the things that need to happen and passes various engines their respective arguments.
makes the file specific to the used engine
Of course. The internals of the PDF file format are such that you're never going to get different PDF engines to match how they actually put the file together. Any reproducibility will only be reproducible on the same engine (and likely with a lot of other factors being involved as well, such as same-versions of system libraries such as the text shaper, the same versions of font files, the same versions of templates or classes, and so on).
So the \special{pdf:trailerid [ ... ]} thing solved the XRef related string changing?
Only for xelatex as described in https://tex.stackexchange.com/questions/229605/reproducible-latex-builds-compile-to-a-file-which-always-hashes-to-the-same-va/313605#313605
It's great to have these instructions and perhaps we should include them in the FAQ on the website. I'm not sure how pandoc could be modified to make this easier, though.
I suppose we could add a variable pdf_trailer_id which, if set, adds the relevant bits in the default latex template. Maybe also pdf_creation_date and pdf_modification_date? But this only works for xelatex?
I tested only with xelatex. The StackOverflow answer lists other engines as well. I would expect a flag like --reproducible, which could ideally come with some ways to get debug output describing what pandoc does.
Are there other tools or implementations that allow for the independent verification of the reproducible PDFs that are generated by pandoc?
@jgm, what are your thoughts on making reproducible PDFs the default option for pandoc? What would block the switch to reproducible PDFs as the default option?
See above. At best we can do this with xelatex. So it can't be a default. We can try to make it easier.
Also, many users will want the PDFs to reflect the actual production date and will not want to specify an identifier manually.
See above. At best we can do this with xelatex.
That doesn't seem correct, if you're referring generally to making PDFs reproducibly using other engines - I believe it's quite possible to get reproducible builds with pdflatex, too, and the page at https://tex.stackexchange.com/questions/229605/reproducible-latex-builds-compile-to-a-file-which-always-hashes-to-the-same-va/313605#313605, already linked above here, gives advice on how to do so. I believe it's just the \special{pdf:trailerid [ ... ]} command that is xelatex-specific. (Apologies if I've misunderstood the previous comment.)
I actually use an approach based on the pdfprivacy package to get reproducible PDFs, but I believe using the SOURCE_DATE_EPOCH environment variable will work as well.
(For the pdfprivacy approach, I just use a command invocation like pandoc -V header-includes='\usepackage[nopdftrailerid=true]{pdfprivacy}' -o out.pdf -t latex my-input-file.md.)
I can't recall now why I said that it would only work with xelatex. It seems that we could include material in the default template that allows reproducible builds with all engines. This wouldn't be the default; it would have to be triggered by a variable, perhaps pdf_trailer_id, pdf_modification_date, pdf_creation_date as suggested above?
Yes, I think that is so - and those all sound like reasonable variable names. (I assume they would all be capable of taking an empty string, if desired.)
I tested the workaround for lualatex (taken from the excellent stackoverflow article above) by adding the following YAML to my pandoc defaults:
header-includes:
- \pdfvariable suppressoptionalinfo \numexpr32+64+512\relax
The generated PDF checksums were then identical on independent runs.
So let's summarize what is needed for each engine here. Please let me know if things need adjusting here:
xelatex
Set SOURCE_DATE_EPOCH to seconds since Jan 1 1970 +0000UTC. Then add:
\special {pdf:trailerid [
<00112233445566778899aabbccddeeff>
<00112233445566778899aabbccddeeff>
]}
Q: Any way to do this without setting source date and trailer id explicitly? (I.e., can they just be omitted entirely?)
lualatex
\pdfvariable suppressoptionalinfo \numexpr32+64+512\relax
Q: Will this suppress the date? What if we want to set it via SOURCE_DATE_EPOCH?
pdflatex
\pdfinfoomitdate=1
\pdftrailerid{}
Q: What if we want to set the date via SOURCE_DATE_EPOCH?
or
\usepackage[nopdftrailerid=true]{pdfprivacy}
Q: What is the difference between these approaches?
Aside: I did a quick test with lualatex (version 1.15.0, TeX Live 2022). The suppressoptionalinfo works unless a source file contains \DTMcurrenttime (and I imagine other time sensitive fields). This might be worth noting/documenting.
My test PDF containing DTMcurrenttime was made reproducible again after setting SOURCE_DATE_EPOCH the same.
@jgm any update on this to include it as a pandoc feature? the --reproducible flag that someone mentioned would be so convenient..
If people would answer the questions I pose above, we could make some progress. I don't have enough data yet.