MathJax icon indicating copy to clipboard operation
MathJax copied to clipboard

Performance issues with semantic enrichment

Open rossberg opened this issue 5 months ago • 5 comments

Issue Summary

With semantic enrichment turned on, pages of the WebAssembly spec take 4x the time to render, with up to 10 or 20 s for some users; see issue WebAssembly/spec#1972.

MathJax 4 apparently turns on semantic enrichment by default. Given the performance hit, this should be considered a breaking change, as we had to turn it off manually.

Steps to Reproduce:

  1. Follow the link in WebAssembly/spec#1972 OP.
  2. Manually turn Semantic Enrichment on again through the MathJax context menu — I see the page freeze for approximately 10 s on my machine.

Technical details:

  • MathJax Version: 4.0
  • Client OS: (e.g., Mac OS X 15.6.1)
  • Browser: (e.g., Safari 18.6)
  • Hardware: MacBook Air M2, 16 GB RAM

I am using the following MathJax configuration:

    window.MathJax = {
        "tex": {
            "maxBuffer": 30720,
            "macros": {
                "multicolumn": ["", 2]
            }
        },
        "options": {
            "menuOptions": {
                "settings": {
                    "enrich": false
                }
            }
        }
    }

and loading MathJax via

    <script defer="defer" src="https://cdn.jsdelivr.net/npm/mathjax@4/tex-mml-chtml.js"></script>

Supporting information:

  • For live link please see above.

rossberg avatar Sep 19 '25 21:09 rossberg

Although MathJax can be used to typeset non-mathematical material, its mission is to handle mathematical layout, and its semantic enrichment is based around the expressions being actual mathematics, so it has to work hard to analyze things like the tables you are presenting.

MathJax converts the TeX input into MathML trees internally, and your tables generate very large trees. For example, the second one in the page you link to has over 1700 nodes in its tree, and that means the semantic analysis has a lot of work to do. Andit is not even the largest table on the page, in a page that contains more than a dozen similar tables. You are pushing MathJax limits (as you probably know, since you had to increase its buffer size just to lay out these tables).

One of the things that is making the semantic analysis more difficult is the total number of nodes in the tables, and the way the table is coded contributes to that. For example, the table uses things like \mathtt{0x0A}, which is telling MathJax to treat 0x0A as mathematics, which is parsed as the number 0 followed by a variable x followed by a number 0 followed by a variable A. (MathJax doesn't know about hexadecimal numbers, though you could configure its number pattern to recognize them if you wished.) That means that \mathtt{0x0A} produces four nodes in the MathML tree. It would be better to use \texttt{0x0A}, which produces a single node instead. Similarly, \mathsf{catch\_all\_ref} produces catch followed by _ followed by all followed by ref, so 5 nodes, while \textfs{catch\_all\_ref} produces only one node containing catch_all_ref.

Changing from the \math macros to \text ones reduces the number of nodes in that table by over 360 nodes. Your original table takes about 445 milliseconds for the semantic-enhancement on my computer, but with the \text macros rather than \math ones, that is reduced to about 290 ms, a savings of 155 ms, or roughly 1/3, from just that one change.

Another source of extra complication is actually the use of ~~ for spacing, as the semantic-enrichment has to try to figure out what the meaning of those two spaces is. It is better to use a single spacing command, like \hspace{.66em}, which would produce the same amount of space with only one node instead of two. That saves you an additional 20 ms.

There are a few other simplifications that can be made as well. For example, there are extra braces that are not needed, as in {\mathtt{0x0A}}, where the outer braces produce an extra nesting node that is not needed. Removing those gets you another 15 to 20 ms. There is also an extra column that is entirely blank. That can be removed, for a savings of an additional 25 to 30ms.

Taken all together, the changes that I've listed above reduce the semantic processing for this table from 445 down to about 225 ms, a savings of 220 ms, or nearly 50%. That is strictly from improving the structure of the table.

Of course, I suspect you are generating this from an original LaTeX document using some layout package for the syntax tables, and you may not have control over the output. In that case you can still realize much of the savings by making definitions equivalent to

\def\mathtt#1{\texttt{#1}}
\def\mathsf#1{\textsf{#1}}
\def\mathit#1{\textit{#1}}

in your configuration, via

 window.MathJax = {
    "tex": {
        "maxBuffer": 30720,
        "macros": {
            "multicolumn": ["", 2],
            "mathtt": ["\\texttt{#1}", 1],
            "mathit": ["\\textit{#1}", 1],
            "mathsf": ["\\texsf{#1}", 1]
        }
    },
    "options": {
        "menuOptions": {
            "settings": {
                "enrich": false
            }
        }
    }
}

to get the initial 1/3 savings.

You could add a TeX input pre-filter to convert ~~ to \hspace{.66em}, and perhaps you could remove the unwanted column if it is in your original source. Otherwise the pre-filter might be able to remove it (and the array template entry for that column). That would get you most of the savings described above.

The semantic enrichment is required for the speech-text generation (and is also useful for line breaking, which you aren't using here). We need the speech generation to be on by default in order to support users with assistive needs. Expecting those who already have special needs to (a) realize that they can turn on assistive tools in MathJax, (b) know how to do so, and (c) be willing to turn on a feature that can be used to identify them as having special needs (which many such users are rightly hesitant to do) is asking the community that needs the most support to do the most work. We want to make it as easy for them as possible.

In your case, you need to consider how you want your page to be presented for those with assistive needs. Having the entire table read out as a formula is probably not the best solution, so I think you are right to turn off the assistive tools, even if the speed were not an issue for you. These are the kinds of decisions that need to be considered when deciding how to display tables like these.

The real issue, of course, is that you are using MathJax basically for formatting a textual table that has a small amount of math in it. The table would be better handled as an HTML table with embedded math at the locations where it is actually used. I made a rough conversion of that one table to HTML with MathJax only for the math (and a few other things like arrows and the vertical bars and dots), which produced this output

Image

compared to your original of

Image

In this case even though there are over 100 math expressions within this HTML table, the semantic-enrichment only took 25 ms! This is an order of magnitude better than even the best time for your all-in-one-expression table (and your original is over 17 times longer).

This would be the best solution, should it be possible, as it makes the table part of the document (not the math), and allows MathJax to handle just the mathematical part of the table. This will work better with screen readers, as well as being more performant.

As for whether this is a "breaking change", one of the reasons we moved to v4.0 rather that calling it v3.3 is that there are a number of potentially breaking changes, and so you need to opt in to using it by switching to v4 explicitly. We did document that the assistive tools are on by default in v4, but not in the breaking changes section, as I don't consider it a breaking change. Your usage of mathJax to format tables is an outlier, and does push the semantic enrichment to the limit. Most expressions will not have quite the impact that yours have.

dpvc avatar Sep 26 '25 21:09 dpvc

@dpvc, thank you a lot for the extensive feedback. Lots of valuable insight!

So far we've told people to turn semantic enrichment back on manually if they need it. FWIW, one recent complaint with it being off by default has been that search doesn't properly work without it.

Yes, you are guessing correctly that our pages are generated, in fact through a rather complicated tool chain that has to produce both HTML and PDF (and goes through Sphinx, among other things). So there aren't many options for non-Latex means of producing these tabled formulas. This also is the source of the many redundant braces.

The problem with using \text vs \math is that the former is too sensitive to context. For example, it breaks formulas in section headers, where it would suddenly apply inherited text attributes like boldface to the argument, at least in the Latex/PDF output. So at least in proper Latex, it has the wrong semantics. Does MathJax differ on that?

As for ~~, that is good to know! I'll experiment with using hspace instead and see how much it bloats the document size.

Btw, at the top of our wishlist for MathJax is support for \multicolumn. Right now, the HTML output is severely worse than the PDF, because it has to work around the lack of multicolumn. With 4.0 out, are there plans for supporting it in the future?

Thanks again!

rossberg avatar Oct 12 '25 19:10 rossberg

Sorry to not get back to you earlier.

Does MathJax differ on that?

Yes, MathJax doesn't inherit the surrounding font weight or style, so if used in a bold heading, for example, it will not be bold.

I have an example configuration above that would redefine \mathtt to \texttt, etc., which would only affect Mathjax and not LaTeX/pdf (since the configuration is only for MathJax). That might be able to improve things, but if you use \mathtt and the others in other settings, that might not be feasible.

Alternatively, you might be able to use a TeX input jax pre-filter that would adjust the TeX before processing it to convert the \math macros to \text ones, if you had a means of identifying the tables. If you can't tell structurally, then a last resort could be the use of a macro that produces no output but that could be used at the start of such tables that the pre-filter could identify. That is, you could define \syntaxTable to be an empty macro that is used at the beginning of these tables and have the pre-filter look for that.

Another approach would be to post-process the internal MathML produced from the TeX to reduce redundant mrow elements, for example, and recombine the individual characters from \mathtt into a single <mtext mathvariant="monospace"> element. I haven't looked into how hard that would be, but I suspect it would be faster than running the speech-rule-engine on the original MathML.

Those are some options, as I see them.

the top of our wishlist for MathJax is support for \multicolumn

Yes, I know, it is an important feature that is on our wish-list, too. It will require implementing the MathML rowspan and columnspan attributes for mtable, which significantly complicate an already very complex layout process. The CHTML output for mtable will need to be completely rewritten to use something like a CSS grid layout instead of the inline-table that is currently in use, since that doesn't allow for column or row spanning. Trying to determine column widths (or row heights) along with the mtable options for columnwidth (fixed size, auto, fit, percentage width, etc.) and equalcolumns (and equalrows) becomes much more complex, and add in line breaking and the changes to sizes that that makes, it is quite a scary prospect.

It is on the books to be done, but it is not a simple change.

dpvc avatar Oct 25 '25 23:10 dpvc

@dpvc, I tried your suggestion of macro-redefining \mathtt to \texttt and friends for the HTML. Unfortunately, that breaks some of the typesetting, where it's combined with style options like \scriptstyle. For example:

Image

This is the expected output:

Image

So a global replacement after the fact won't work, I'm afraid, I'll have to find a way of doing it in a context-dependent manner.

rossberg avatar Oct 30 '25 11:10 rossberg

Using \scriptsize rather than \scriptstyle would work in both \text and \math macros, so perhaps you can use that instead? Or use a MathJax tex input jax pre-filter to convert \scriptsyle to \scriptsize (perhaps only inside \text)? It is also possible to make text-only macros package that redefines \scriptstyle as \scriptsize.

Out of curiosity, did you get a speed improvement for that change?

dpvc avatar Oct 30 '25 12:10 dpvc

I'm closing this, as I think we have identified the issues and work-arounds.

dpvc avatar Dec 20 '25 17:12 dpvc