LaTeXML
LaTeXML copied to clipboard
[WiP] Support for dumping plain.tex, latex.ltx
This PR provides support for reading and dumping TeX format files (currently only plain.tex and latex.ltx) and then using those dumps for processing. This can be much faster for recent (2023+) latex.ltx releases, particularly when latex3 (expl3) code is used. It also makes available all the format's internal macros, which may improve processing raw style files (as in --includestyles).
The basic strategy is to split the current format pools (TeX.pool.ltxml, LaTeX.pool.ltxml) into 3 components: bootstrap, main & constructs. The bootstrap contains those special LaTeXML bindings necessary to be able to process the raw format file, such as register allocation; The constructs file contains LaTeXML bindings for all commands that imply non-trivial semantics or document structure, particularly for XML; The main component is simply what's left of the original format pools.
The dump is then created by reading in the format file and writing out all data & definitions that have changed. To run, we simply load all three components in order, using either the original main or the dump file for the middle component.
Managing the dump files still needs some thought! Currently, to create & use dumps, you need to do:
latexml --init=plain.tex
latexml --init=latex.ltx
which will create a dump file for plain & latex from whatever TeX system you have installed, and place those dump files in /tmp/ (!). (With 2024 latex, it may take 5-10 minutes!). To use those dump files, you must set the environment variable LATEXML_USEDUMP=1 (eg in front of latexml or make test or whatever).
Note that if your TeX installation is 2022 or before, you'll likely get 2 test errors (lettercase,textcase), since they now conform to latex's more recent case-changing behavior.
Hmm, windows problem. Maybe I need to change the hash key to discard the cache?
This is getting close to being testable. latexml now respects the --destination option along with --init. Even more convenient, use make formats (between make and make test). See more details in the updated initial comment.
Yeah, might as well. At least the scary stuff is still optional.
Recording a nice trick with workflows using the cpanminus installation approach. One can opt-in to the formats target by adding it as a build argument. Courtesy of its dependence on "all", it will also run make prior. So this is all it takes:
cd LaTeXML;
cpanm --build-args formats .
workflows using the cpanminus installation approach
Does the current dump include installation-dependent details beyond the LaTeX version itself, like hypenation patterns and default paper size? I was wondering if dumps can be distributed separately.
@xworld21 I think not yet. I don't even think we have figured out how far we want to take the approach - the current "precompilation" started off to tackle two big problems: 1) a performance regression with expl3 and 2) getting more coverage from the updated latex kernels.
Turning it into cls-like profiles may be out of scope for now. But there is a related discussion we keep mulling over (no plan yet) about raw interpretation of .cls files. We may want a new issue for more discussion?
I'm having trouble with this. I run perl Makefile.PL, make, and then make formats. I think the --init=plain.tex step completes successfully (Conversion complete: 4 warnings), but the --init=latex.ltx step has
Loading /usr/local/texlive/2025/texmf-dist/tex/latex/base/latex.ltx
LaTeX must be made using an initex with no format preloaded:
!! No syntax for the current directory could be found
\@currdir set to:
Assuming \openin and \input
Defining generic filename parser.
Error:undefined:\@@ The token T_CS[\@@] is not defined. at latex.ltx; line 8624 col 23
before giving 100 more errors and the fatal error. Am I missing a step?
@teepeemm Valid question. This was the last topic we discussed with Bruce before the unfortunate pause of work at NIST. He had identified a regression to make formats with the changes made in #2646 . Likely the last commit that was healthy for this target is e64d7f5f5139b2a57cb0bc6585400b403c94c95c
I can also relay Bruce was close to a patch for the regression, just before the shutdown took hold. Sadly, almost everything "official" is paused in LaTeXML land until normal operations resume at NIST.