LaTeXML
LaTeXML copied to clipboard
Allow flat xml:id attributes for math
This issue requests a setting that changes the preference for math identifiers from hierarchical to global. I can implement the PR if there is interest.
A global id is very simple to realize in the base case. for a document with 1000 formulas, each with 1000 nodes, we would see the id fragments m1 to m999999. We could make it mildly more sophisticated by having a counter for top-level formulas m and a counter for inner formula nodes (maybe xm).
Motivation
I am currently bundling the newest arXMLiv dataset and inspecting the sources. Some bits are jarring even on a tenth encounter. I don't have enough space to paste the full formula on Github (it easily overflows a screen in the source view), but here is the presentation node for a single open parenthesis (from the second section of arXiv:1410.8088):
<mo id="S2.SS0.Ex5.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2" xref="S2.SS0.Ex5.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml">(</mo>
First this is quite jarring when a developer/author first encounters it. Second, when multiplied by a billion formulas, it starts getting taxing on the allocated space for arXMLiv.
Tools like tralics prefer a completely flat scheme, such as:
<mo id="cid4209" xref="cid4992">(</mo>
That seems a bit overboard, though maximizes savings in size. For HTML, one idea is to only go flat for inner math nodes, where the problem is most pronounced, i.e.
<mo id="S2.SS0.Ex5.m1.xm18" xref="S2.SS0.Ex5.m1.xm18.cmml">(</mo>
Edit: here is also a motivating example where the document context prefix gets rather long:
<mi id="S3.SS2.SSS2.p5.10.m10.1.1.2.2" xref="S3.SS2.SSS2.p5.10.m10.1.1.2.2.cmml">F</mi>
Personally, I would be OK with going the extra step further and discarding the document context from math element ids, and instead using only a global counter for math nodes with a secondary counter for the internal nodes:
<mo id="m576.xm18" xref="m576.xm18.cmml">(</mo>
To conclude, there are two problems I am motivated to address:
- make the XML of large formulas more readable, by avoiding long horizontal strings in the
idattribute - find some secondary wins in making the XML and HTML size more compact
I am aware that the primary behavior must be kept as-is long term, so that DLMF can continue to be regenerated with the math ids it has today. So this ought to be an optional switch, likely in latexml.sty.
There is also something very verbose happening to IDs for math elements that end up inside SVG diagrams. As an example, here is a screenshot from the browser inspector:
From a recent arXiv article: https://arxiv.org/html/2402.12530v1#S6.SS2.SSS0.Px2