Temml icon indicating copy to clipboard operation
Temml copied to clipboard

Accented characters cause \text to generate <mtext> nodes for individual characters

Open Nemin32 opened this issue 9 months ago • 3 comments

I've been trying to use this tool to create some annotated equations for my website. Because I'm writing them in my native tongue, I need to use accented letters like á, é, ű, and ó. In the Temml.org editor the output appeared fine, however, when I pasted the resulting MathML onto my site, the text became unreadable due to the spaces disappearing.

I've investigated the MathML output and it turns out if accented characters are used, the generator creates an <mtext> node for each individual character instead of grouping the text into one, causing spaces to become empty <mtext> nodes instead of rendering properly as part of the text.

Steps to reproduce:

  1. Go to https://temml.org/
  2. Enter \text{Helló world}
  3. Switch to MathML display

Expected output:

<math display="block" class="tml-display" style="display:block math;">
  <mtext>Helló world</mtext>
</math>

Actual output:

<math display="block" class="tml-display" style="display:block math;">
  <mrow>
    <mtext>H</mtext>
    <mtext>e</mtext>
    <mtext>l</mtext>
    <mtext>l</mtext>
    <mover>
      <mtext>o</mtext>
      <mo stretchy="false" class="tml-xshift" style="math-style:normal;math-depth:0;">ˊ</mo>
    </mover>
    <mtext>
    </mtext>
    <mtext>w</mtext>
    <mtext>o</mtext>
    <mtext>r</mtext>
    <mtext>l</mtext>
    <mtext>d</mtext>
  </mrow>
</math>

Why is this an issue?

Beyond severely bloating the output, Temml generates empty <mtext> nodes for spaces, causing the text to become smushed together and unreadable. Using accent functions like \'{o} instead of the actual letter still reproduce the issue.

image

  1. Row generated using \text{Hello world}
  2. Row generated using \text{Hello world} and o swapped for ó manually in the MathML output.
  3. Row generated using \text{Helló world}

Nemin32 avatar May 03 '24 07:05 Nemin32

I have been unable to reproduce the lack of a space, but I agree with the rest your analysis. Temml ought to do better. I will work on it but be advised that this is not a trivial issue. It will take some time. I like to do one release each month and I'll make this issue the focus of this month's work.

If you are interested in the inner workings, read on.

Temml is forked from KaTeX and the fix for this issue will involve improvements to the legacy code base. KaTeX is tightly bound to the KaTeX fonts, which do not contain glyphs for accented characters like ó. So the KaTeX parser breaks such characters apart into their Unicode normalized version. That way, it can use the KaTeX font's o glyph and an acute accent glyph.

Temml uses math fonts that contain many more glyphs, including some accented characters like ó. I will need to revise the Temml parser to avoid Unicode normalization when unnecessary. Possibly to avoid normalization altogether.

Both LaTeX and KaTeX treat text on a character-by-character basis. In Temml, I've added code to consolidate \text{…} groups into a single mtext element. But Temml reverts to single-character elements if it encounters something in the group which does not qualify as plain text. It reacts badly to a <mover> element inside a text group. By avoiding Unicode normalization, we'll get the desired (consolidated) mtext element.

ronkok avatar May 03 '24 16:05 ronkok

I like to do one release each month and I'll make this issue the focus of this month's work.

Thank you very much.

Temml is an extremely handy tool. Previously I've largely hand-wrangled MathML, but once that became severely unwieldy, I looked for another solution. Yours just worked and supports all my use-cases so far, so beyond this slight niggle, I'm very happy to use it.

I have been unable to reproduce the lack of a space

The odd thing is that when I enter the accented version on your site, it previews just fine. However, when I take the MathML output and paste it into its own HTML file and then open it in my browser (I've tested both Firefox and Chromium), the spaces are gone. Sorry, if my initial report hasn't been entirely clear on this detail. If you take the output and put it in its own file, can you reproduce the issue?

I suspect the reason why this happens on my end is because Temml generates the following:

<mtext>[a space literal]
</mtext>

Which I assume is subsequently considered throwaway whitespace by the HTML parser, just like how it would throw away the four spaces and linebreak in the paragraph element below:

<p>
    An example.
</p>

Meanwhile in the consolidated output the space is surrounded by other characters, so the HTML parser considers it significant. One solution I could imagine is making an exception for spaces in the Tex parser to instead output &nbsp;, which get rendered properly by browsers. Though, of course, I'm not sure if that wouldn't cause issues elsewhere, it's just a first guess.

Thank you again for looking at the issue and if there's anything I can report back to help your work, please feel free to just ask.

Nemin32 avatar May 03 '24 17:05 Nemin32

instead output &nbsp;

That may help. It's worth a look.

ronkok avatar May 03 '24 17:05 ronkok

I think you will find that release v0.10.27 has resolved this issue.

ronkok avatar May 14 '24 01:05 ronkok

Yes, now it works perfectly. Thank you for your speedy work!

Nemin32 avatar May 14 '24 07:05 Nemin32