fontspec icon indicating copy to clipboard operation
fontspec copied to clipboard

luatex + harfbuzz and the zero width joiner U+200D

Open ralessi opened this issue 5 years ago • 7 comments

In some cases, namely when commands are inserted between characters, luatex + harfbuzz do not seem to handle the zero width joiner character (U+200D) properly. Consider the following example, to be compiled with lualatex-dev:

\documentclass[12pt]{article}
\usepackage{fontspec}

\newfontfamily\arabicfont{Amiri}[Script=Arabic]
\newfontfamily\arabicfonthb{Amiri}[Script=Arabic,Renderer=Harfbuzz]

\usepackage{ulem}

\begin{document}

\textdir TRT\arabicfont
دَخَلَ مُب‍\uline{‍تَ‍}‍سِمًا

\medskip

\textdir TRT\arabicfonthb
دَخَلَ مُب‍\uline{‍تَ‍}‍سِمًا

\end{document}

test-zwj

ralessi avatar Feb 29 '20 08:02 ralessi

I don't get your output with the development version of luaotfload. With it is looks like this:

image

This is still not correct, but

  • I don't think it is a fontspec issue, but should better be reported in the luaotfload github
  • Probably it is even not a luaotfload issue: you are inserting a rule between the chars and harfbuzz doesn't like this.
  • you should probably do the underlining with lua code to avoid this side effect. See e.g this code from @zauguin: https://tex.stackexchange.com/a/446488/2388

u-fischer avatar Feb 29 '20 12:02 u-fischer

Thank you for the references which I will explore. I suspected that this might be unrelated to fontspec. Do you think it should be worth reporting this---maybe unrelated again---issue to the luaotfload bug tracker?

ralessi avatar Feb 29 '20 12:02 ralessi

FWIW, this seems to be a regression in luaotfload. Trying the following with harflatex and the old harf code:

\documentclass[12pt]{minimal}
\usepackage{harfload}
\usepackage{ulem}
\begin{document}

\font\arabicfont="[Amiri-Regular.ttf]:mode=harf"
\textdir TRT\arabicfont
مُب^^^^200d\uline{^^^^200dتَ^^^^200d}^^^^200dسِم

\end{document}

Gives:

khaledhosny avatar Mar 01 '20 21:03 khaledhosny

This was a luaotfload bug which is resolved in the latest dev branch.

zauguin avatar Mar 03 '20 21:03 zauguin

The behavior of HarfBuzz seems a bit odd here but I don't know enough about the script to say if it is a bug or expected behaviour:

The luaotfload bug was that in \hboxes the direction wasn't recognized correctly. So the \uline argument was set as TLT instead of TRT.

Now to the odd part: For some reason, HarfBuzz seems to reverse the cluster with the arabic characters and ignore the previous ZWJ. This can be reproduced with hb-shape:

hb-shape --direction=rtl --font-file "$(kpsewhich Amiri-Regular.ttf)" --script=arab --unicodes=U+200D,U+062A,U+064E,U+200D

gives

[space=1+0|uni064E=1@-188,0+0|uni062A.medi=1+244|space=0+0]

as expected, but replacing --direction=rtl with --direction=ltr gives

[space=0+0|space=1+0|uni064E=1@-212,0+0|uni062A.init=1+190]

Especially both space glyphs representing the ZWJs are at the beginning and the initial form is used.

@khaledhosny Is this supposed to happen?

zauguin avatar Mar 03 '20 21:03 zauguin

Yes, sort of.

HarfBuzz wants to shape scripts in their native direction. So when setting a direction other than the native direction for a script, HarfBuzz will reverse the buffer before shaping. It will also avoid breaking grapheme clusters, as one does not want, say, a mark to precede its base. ZWJ is a grapheme extender, so the first ZWJ is consider a grapheme cluster by itself (as it extends nothing) and the base+mark+ZWJ are considered another grapheme cluster.

<U+200D>,<U+062A,U+064E,U+200D>

After reversal:

<U+062A,U+064E,U+200D>,<U+200D>

After shaping the buffer will be reversed again since the native direction is RTL (a simple reversal this time with no grapheme clusters business).

U+062A,U+064E,U+200D,U+200D

After reversal:

U+200D,U+200D,U+064E,U+062A

If you set the script to latn when the direction is ltr, no reversal will happen:

 $ hb-shape --direction=ltr --font-file "$(kpsewhich Amiri-Regular.ttf)" --script=latn --unicodes=U+200D,U+062A,U+064E,U+200D
 [space=0+0|uni062A=1+926|uni064E=1+0|space=1+0]

latn with rtl will do the initial reversal but not the last one:

$ hb-shape --direction=rtl --font-file "$(kpsewhich Amiri-Regular.ttf)" --script=latn --unicodes=U+200D,U+062A,U+064E,U+200D
[uni062A=1+926|uni064E=1+0|space=1+0|space=0+0]

Shaping a script in a direction other than its native direction is risky and unlikely to always give meaningful result.

khaledhosny avatar Mar 03 '20 23:03 khaledhosny

@khaledhosny Thank you.

zauguin avatar Mar 04 '20 12:03 zauguin