LaTeXML icon indicating copy to clipboard operation
LaTeXML copied to clipboard

[WiP] Support for dumping plain.tex, latex.ltx

Open brucemiller opened this issue 8 months ago • 1 comments

This PR provides support for reading and dumping TeX format files (currently only plain.tex and latex.ltx) and then using those dumps for processing. This can be much faster for recent (2023+) latex.ltx releases, particularly when latex3 (expl3) code is used. It also makes available all the format's internal macros, which may improve processing raw style files (as in --includestyles).

The basic strategy is to split the current format pools (TeX.pool.ltxml, LaTeX.pool.ltxml) into 3 components: bootstrap, main & constructs. The bootstrap contains those special LaTeXML bindings necessary to be able to process the raw format file, such as register allocation; The constructs file contains LaTeXML bindings for all commands that imply non-trivial semantics or document structure, particularly for XML; The main component is simply what's left of the original format pools.

The dump is then created by reading in the format file and writing out all data & definitions that have changed. To run, we simply load all three components in order, using either the original main or the dump file for the middle component.

Managing the dump files still needs some thought! Currently, to create & use dumps, you need to do:

latexml --init=plain.tex
latexml --init=latex.ltx

which will create a dump file for plain & latex from whatever TeX system you have installed, and place those dump files in /tmp/ (!). (With 2024 latex, it may take 5-10 minutes!). To use those dump files, you must set the environment variable LATEXML_USEDUMP=1 (eg in front of latexml or make test or whatever).

Note that if your TeX installation is 2022 or before, you'll likely get 2 test errors (lettercase,textcase), since they now conform to latex's more recent case-changing behavior.

brucemiller avatar Mar 04 '25 23:03 brucemiller

Hmm, windows problem. Maybe I need to change the hash key to discard the cache?

brucemiller avatar Mar 04 '25 23:03 brucemiller

This is getting close to being testable. latexml now respects the --destination option along with --init. Even more convenient, use make formats (between make and make test). See more details in the updated initial comment.

brucemiller avatar Aug 05 '25 21:08 brucemiller

Yeah, might as well. At least the scary stuff is still optional.

brucemiller avatar Aug 05 '25 22:08 brucemiller

Recording a nice trick with workflows using the cpanminus installation approach. One can opt-in to the formats target by adding it as a build argument. Courtesy of its dependence on "all", it will also run make prior. So this is all it takes:

cd LaTeXML;
cpanm --build-args formats .

dginev avatar Aug 11 '25 19:08 dginev

workflows using the cpanminus installation approach

Does the current dump include installation-dependent details beyond the LaTeX version itself, like hypenation patterns and default paper size? I was wondering if dumps can be distributed separately.

xworld21 avatar Aug 12 '25 11:08 xworld21

@xworld21 I think not yet. I don't even think we have figured out how far we want to take the approach - the current "precompilation" started off to tackle two big problems: 1) a performance regression with expl3 and 2) getting more coverage from the updated latex kernels.

Turning it into cls-like profiles may be out of scope for now. But there is a related discussion we keep mulling over (no plan yet) about raw interpretation of .cls files. We may want a new issue for more discussion?

dginev avatar Aug 12 '25 15:08 dginev

I'm having trouble with this. I run perl Makefile.PL, make, and then make formats. I think the --init=plain.tex step completes successfully (Conversion complete: 4 warnings), but the --init=latex.ltx step has

Loading /usr/local/texlive/2025/texmf-dist/tex/latex/base/latex.ltx
LaTeX must be made using an initex with no format preloaded: 
!! No syntax for the current directory could be found
\@currdir set to:
Assuming \openin and \input
Defining generic filename parser.
Error:undefined:\@@ The token T_CS[\@@] is not defined. at latex.ltx; line 8624 col 23

before giving 100 more errors and the fatal error. Am I missing a step?

teepeemm avatar Oct 19 '25 00:10 teepeemm

@teepeemm Valid question. This was the last topic we discussed with Bruce before the unfortunate pause of work at NIST. He had identified a regression to make formats with the changes made in #2646 . Likely the last commit that was healthy for this target is e64d7f5f5139b2a57cb0bc6585400b403c94c95c

I can also relay Bruce was close to a patch for the regression, just before the shutdown took hold. Sadly, almost everything "official" is paused in LaTeXML land until normal operations resume at NIST.

dginev avatar Oct 19 '25 01:10 dginev