luaotfload
luaotfload copied to clipboard
harf: missed hyphenation points with kerns
With harf, almost no hyphenation points are found here. The reason is the harf mode is creating a huge discretionary with all those letters instead of inserting small ones with just the kern in replace and the hyphen in pre. See https://github.com/latex3/babel/issues/68 .
\documentclass{article}
\newlanguage\testl
\language\testl
\usepackage{fontspec}
\setmainfont
[Renderer=Harfbuzz] % Wrong with harf. Uncomment and it's ok
{FreeSerif}
\patterns{á3é3}
\begin{document}
\hsize1pt
xxxáéáéáéáéáéáéxxx
\end{document}
if you give me a pointer to the code handling this, I could take a look at it.
@jbezos It's here: The run, basically a sequence of glyphs with the same font, is shaped and then the discretionaries are analyzed. Every discretionary is extended until it's boundaries are glyphs which HarfBuzz reports as "safe to break" (See https://github.com/harfbuzz/harfbuzz/issues/1463 for a description what "safe to break" means, why it is not actually the right property, how a more useful property would look like and how more correct linebreaking for OpenType could work in general). This is the step which drops all other disc nodes in the affected segment. We can't skip it because otherwise characters interact with characters from other lines. Also HarfBuzz does not indicate why it considers a point "unsafe to break", so we can't easily treat kerning as a special case.
Thank you. Quite a challenge 😉.
There's a reason why I wasn't all that motivated to work on this directly: I thought about it quite a bit when we imported the harf code but didn't come up with a good solution yet.
I'm looking forward to your ideas :)
And there is a reason I'm motivated 😉 – I've continued doing tests and definitely a good deal of hyphenation points are not found, particularly in heavily kerned fonts. For example, with FreeSerif and spanish, I get en-tre-acto with harf instead of en-tre-ac-to (here harf is creating internally two long discretionaries followed by the letter o).
This is a severe typographical issue which could make Harfbuzz unusable in hyphenated languages, particularly in narrow texts.
I'll do my best to find a solution, because I firmly believe the future of luatex depends on the proper functioning of Harfbuzz. But first I have to read (and understand) the code 🧐.
I have dedicated some afternoons to explore a fix, and while I've learnt a little bit about Harfbuzz and how luaotfload uses it, I'll give up for now because babel is my priority and also because words are hyphenated correctly with the default font loader, so in many cases the solution is as simple as not using Harfbuzz in Latin/Greek/Cyrillic. I'll just add a note in the babel manual explaining this limitation.
That's sad, but I'll keep thinking about possible solutions.
Personally I think that it is quite good that we have two modes to handle different requirements: harfbuzz for complex scripts and node mode for complex hyphenation ;-)
the solution is as simple as not using Harfbuzz in Latin/Greek/Cyrillic
I recall that Marcel explained to me recently that harfbuzz is not concerned with such scripts, so there is no problem in not using for them.
But I believe that harfbuzz is on/off depending on the loading of each font, so it is not clear what to do when loading a font that covers, say, arabic and latin scripts. ??
But I believe that harfbuzz is on/off depending on the loading of each font, so it is not clear what to do when loading a font that covers, say, arabic and latin scripts. ??
You would almost certainly load the font twice, one with harfbuzz and one without …
W
So Javier's problem is solved. Good!
It is yes — but it means, for example, that I couldn’t really switch over to use Harfbuzz by default in LuaTeX (at least at the moment). I’m glad this came up here because I didn’t realise it was a problem…
So Javier's problem is solved. Good!
It is as long as you are happy with the ConTeXt fontloader. While HarfBuzz normally isn't needed for these scripts, there are still fonts which profit from using it, especially on MacOS (traditional MacOS fonts used AAT instead of OTL tables, so their ligatures/kerning only works with HarfBuzz which implements AAT too).
So "solved" goes a bit far, but it can normally be avoided by using the right mode for the script.
But I believe that harfbuzz is on/off depending on the loading of each font, so it is not clear what to do when loading a font that covers, say, arabic and latin scripts. ??
You would almost certainly load the font twice, one with harfbuzz and one without … W
At least twice: We normally expect the font to be loaded separatly for every script, so to get optimal shaping such code should load the font twice already. (Using multiscript= you can then make the font appear as one for the frontend, but that works independent from the mode option.)
HarfBuzz handles all scripts it supports equally, and there have been often OpenType Latin fonts that were subtly broken with ConTeXt font loader (on top of my head, both Libertinus and EB Garamond had issues with it).
XeTeX also uses HarfBuzz with TeX-style hyphenation, so it should be doable. I tried my best to get hyphenation working with HarfBuzz, and I wasn’t aware of such limitations. My memory of that code is already fading, so I won’t be of much help, but if the approach I used is not working then it might be worth checking what XeTeX is doing and if it is any better.
@khaledhosny Thanks for the information about XeTeX’s use of harfbuzz. I knew that XeTeX was designed around the basic ideas of harfbuzz and used it by default for all text and all fonts (with a switch to use ‘classic TeX font access’). I was not so sure that this has not changed from the original conception.
@khaledhosny It would be very helpful to know about any form of documentation of the internals of harfbuzz.
For example, how it is intended to deal with word-breaking and how its use interacts with discretionaries etc., or maybe with other line-breaking technologies/algorithms.
So Javier's problem is solved. Good!
😀 But... 🤔 not quite. If you need both stacking diacritics and proper hyphenation you have a problem. Not frequent, but not impossible. So, in these cases there is still a plan C - use xetex 🙃. An example with babel would be:
\babelfont[*latin]{rm}{FreeSerif} % ie, languages with the Latin script
\babelfont[hindi]{rm}[Renderer=Harfbuzz]{FreeSerif}
But ideally, it should be a single declaration, to set the fonts for all languages at once.
More ambitious: fix lua(b)tex (or at least lua(b)tex+otfload) to work just like XeTeX. Either always, or just in certain cases??
Failing that (or meanwhile) it looks as if fontspec (+babel maybe) needs to hide all the mess (e.g., loading the correct fonts as many times as necessary) (behind a DWIM interface:-) so that a babel user does not need to get their hands dirty.
Unfortunately, switching to xetex cannot be hidden very easily!
Actually, this is what \babelfont does, and it's its main purpose (besides switching the font when a language is selected, of course). But if you need two font loaders, you still need at least two declarations, because babel can't know which one is the appropriate for each script, and because fonts with sparse kerning could be fine even with lua/harf, because hyphenation points are discarded when there is a kern nearby.
Maybe as context a bit about how XeTeX approaches this: (I will assume \pretolerance=-1, otherwise a separate run without hyphenation is tried first.)
XeTeX splits text at hyphenation points and then shapes only the parts between the hyphenation points. This ignores ligatures/kerning across discretionaries. Then these widths are used for linebreaking. After linebreaking, the horizontal lists are reshaped, this time taking kerning/liagtures across discretionaries into account.
The effect of this can be exaggerated using the Font Awesome Font: This font has e.g. a ligature which replaces the word "binoculars" with a binoculars icon. So the XeTeX document
\ifx\directlua\undefined\else
\input luaotfload.sty
\fi
\font\FA"[FontAwesome5Free-Solid-900.otf]":script=latn,mode=node,+liga at 10pt
\pretolerance-1
\FA
\count10=1
\loop
binoculars
\ifnum\count10<200
\advance\count10 by 1
\repeat
\bye
results in

where you can clearly see that it leaves space for the word "binoculars" every time the ligature is substituted.
For comparison, LuaTeX gives:

In most practical cases this is not much of a problem though: Many paragraphs don't need hyphenation in the first place so they are good anyway (if \pretolerance has reasonable values) and in the remaining cases most ligatures have roughly the width of the original characters, so the effect is barely visible.
How does this change in LuaTeX? LuaTeX applies hyphenation for all paragraphs independent of \pretolerance. (\pretolerance is still implemented, but instead of only hyphenating if a second pass is needed LuaTeX ignores discretionaries in the first pass) So we either change the linebreak routines or our discretionary handling has to be applied for all paragraphs.
So Javier's problem is solved. Good!
😀 But... 🤔 not quite. If you need both stacking diacritics and proper hyphenation you have a problem.
Yes that was the starting problem. But imho instead of trying to solve this by get more discretionaries in harf mode it sounds more promising to try to get rid of the wrong discretionary in node mode with the grapheme analysis as suggested by @zauguin in the babel issue.
@khaledhosny It would be very helpful to know about any form of documentation of the internals of harfbuzz.
For example, how it is intended to deal with word-breaking and how its use interacts with discretionaries etc., or maybe with other line-breaking technologies/algorithms.
I don’t think HarfBuzz internals are relevant here. TeX model of discretionary line break is rather unique so neither HarfBuzz or any other text layout library I’m familiar with has any special support for the way TeX does it.
I’m trying to reproduce the original issue in Plain, but I can’t:
\input luaotfload.sty
\font\test={file:FreeSerif.ttf:mode=harf}
\test
\newlanguage\testl
\language\testl
\patterns{á3é3}
\hsize1pt
xxxáéáéáéáéáéáéxxx
\bye
gives:

Any idea what is missing?
Nevermind, I always forget you need to set the script with luaotfload.
FWIW, I tried to reproduce this with the old harf package (commit https://github.com/khaledhosny/harf/commit/3d0fd4b0b6d30265365b1e32c41a91613b073212 so that I can use it without messing with luaotfload), and I can’t reproduce the issue while I can reproduce it with luaotfload’s harf mode:
\iftrue
\directlua{require('harf-plain')}
\else
\input luaotfload.sty
\fi
\font\test="[FreeSerif.ttf]:mode=harf;script=latn;language=dflt"\test
\newlanguage\testl\language\testl
\patterns{á3é3}
\hsize1pt
xxxáéáéáéáéáéáéxxx
\bye
Adding
\input nodetree
\nodetreeregister{line}
With luaotfload and harf mode:
Callback: linebreak_filter
- is_display: false
------------------------------------------
├─LOCAL_PAR
│ ╚═attr:
│ ├─ATTRIBUTE_LIST
│ └─ATTRIBUTE
├─HLIST subtype: indent; width: 20pt;
│ ╚═attr:
│ ├─ATTRIBUTE_LIST
│ └─ATTRIBUTE
├─GLYPH subtype: 256; char: "x"; lang: 1; width: 4.82pt; height: 4.5pt;
│ ╚═attr:
│ ├─ATTRIBUTE_LIST
│ └─ATTRIBUTE
├─GLYPH subtype: 256; char: "x"; lang: 1; width: 4.82pt; height: 4.5pt;
│ ╚═attr:
│ ├─ATTRIBUTE_LIST
│ └─ATTRIBUTE
├─DISC penalty: 50;
│ ╠═post:
│ ║ ├─GLYPH subtype: 256; char: "é"; lang: 1; width: 4.44pt; height: 6.76pt; depth: 0.1pt;
│ ║ │ ╚═attr:
│ ║ │ ├─ATTRIBUTE_LIST
│ ║ │ └─ATTRIBUTE
│ ║ ├─KERN kern: -0.1pt;
│ ║ │ ╚═attr:
│ ║ │ ├─ATTRIBUTE_LIST
│ ║ │ └─ATTRIBUTE
│ ║ ├─GLYPH subtype: 256; char: "á"; lang: 1; width: 4.35pt; height: 6.76pt; depth: 0.1pt;
│ ║ │ ╚═attr:
│ ║ │ ├─ATTRIBUTE_LIST
│ ║ │ └─ATTRIBUTE
│ ║ ├─KERN kern: -0.05pt;
│ ║ │ ╚═attr:
│ ║ │ ├─ATTRIBUTE_LIST
│ ║ │ └─ATTRIBUTE
│ ║ ├─GLYPH subtype: 256; char: "é"; lang: 1; width: 4.44pt; height: 6.76pt; depth: 0.1pt;
│ ║ │ ╚═attr:
│ ║ │ ├─ATTRIBUTE_LIST
│ ║ │ └─ATTRIBUTE
│ ║ ├─KERN kern: -0.2pt;
│ ║ │ ╚═attr:
│ ║ │ ├─ATTRIBUTE_LIST
│ ║ │ └─ATTRIBUTE
│ ║ └─GLYPH subtype: 256; char: "x"; lang: 1; width: 4.82pt; height: 4.5pt;
│ ║ ╚═attr:
│ ║ ├─ATTRIBUTE_LIST
│ ║ └─ATTRIBUTE
│ ╠═pre:
│ ║ ├─GLYPH subtype: 256; char: "x"; lang: 1; width: 4.82pt; height: 4.5pt;
│ ║ │ ╚═attr:
│ ║ │ ├─ATTRIBUTE_LIST
│ ║ │ └─ATTRIBUTE
│ ║ ├─KERN kern: -0.2pt;
│ ║ │ ╚═attr:
│ ║ │ ├─ATTRIBUTE_LIST
│ ║ │ └─ATTRIBUTE
│ ║ ├─GLYPH subtype: 256; char: "á"; lang: 1; width: 4.35pt; height: 6.76pt; depth: 0.1pt;
│ ║ │ ╚═attr:
│ ║ │ ├─ATTRIBUTE_LIST
│ ║ │ └─ATTRIBUTE
│ ║ └─GLYPH subtype: 256; char: "-"; lang: 1; width: 3.33pt; height: 2.57pt; depth: -1.94pt;
│ ║ ╚═attr:
│ ║ ├─ATTRIBUTE_LIST
│ ║ └─ATTRIBUTE
│ ╠═attr:
│ ║ ├─ATTRIBUTE_LIST
│ ║ └─ATTRIBUTE
│ ╚═replace:
│ ├─GLYPH subtype: 256; char: "x"; lang: 1; width: 4.82pt; height: 4.5pt;
│ │ ╚═attr:
│ │ ├─ATTRIBUTE_LIST
│ │ └─ATTRIBUTE
│ ├─KERN kern: -0.2pt;
│ │ ╚═attr:
│ │ ├─ATTRIBUTE_LIST
│ │ └─ATTRIBUTE
│ ├─GLYPH subtype: 256; char: "á"; lang: 1; width: 4.35pt; height: 6.76pt; depth: 0.1pt;
│ │ ╚═attr:
│ │ ├─ATTRIBUTE_LIST
│ │ └─ATTRIBUTE
│ ├─KERN kern: -0.05pt;
│ │ ╚═attr:
│ │ ├─ATTRIBUTE_LIST
│ │ └─ATTRIBUTE
│ ├─GLYPH subtype: 256; char: "é"; lang: 1; width: 4.44pt; height: 6.76pt; depth: 0.1pt;
│ │ ╚═attr:
│ │ ├─ATTRIBUTE_LIST
│ │ └─ATTRIBUTE
│ ├─KERN kern: -0.1pt;
│ │ ╚═attr:
│ │ ├─ATTRIBUTE_LIST
│ │ └─ATTRIBUTE
│ ├─GLYPH subtype: 256; char: "á"; lang: 1; width: 4.35pt; height: 6.76pt; depth: 0.1pt;
│ │ ╚═attr:
│ │ ├─ATTRIBUTE_LIST
│ │ └─ATTRIBUTE
│ ├─KERN kern: -0.05pt;
│ │ ╚═attr:
│ │ ├─ATTRIBUTE_LIST
│ │ └─ATTRIBUTE
│ ├─GLYPH subtype: 256; char: "é"; lang: 1; width: 4.44pt; height: 6.76pt; depth: 0.1pt;
│ │ ╚═attr:
│ │ ├─ATTRIBUTE_LIST
│ │ └─ATTRIBUTE
│ ├─KERN kern: -0.2pt;
│ │ ╚═attr:
│ │ ├─ATTRIBUTE_LIST
│ │ └─ATTRIBUTE
│ └─GLYPH subtype: 256; char: "x"; lang: 1; width: 4.82pt; height: 4.5pt;
│ ╚═attr:
│ ├─ATTRIBUTE_LIST
│ └─ATTRIBUTE
├─GLYPH subtype: 256; char: "x"; lang: 1; width: 4.82pt; height: 4.5pt;
│ ╚═attr:
│ ├─ATTRIBUTE_LIST
│ └─ATTRIBUTE
├─GLYPH subtype: 256; char: "x"; lang: 1; width: 4.82pt; height: 4.5pt;
│ ╚═attr:
│ ├─ATTRIBUTE_LIST
│ └─ATTRIBUTE
├─PENALTY subtype: 2; penalty: 10000;
│ ╚═attr:
│ ├─ATTRIBUTE_LIST
│ └─ATTRIBUTE
└─GLUE subtype: parfillskip; stretch: +1fil;
╚═attr:
├─ATTRIBUTE_LIST
└─ATTRIBUTE
-----------------------
With harf only:
Callback: linebreak_filter
- is_display: false
------------------------------------------
├─LOCAL_PAR
├─HLIST subtype: indent; width: 20pt;
├─GLYPH subtype: 256; char: "x"; lang: 1; width: 4.82pt; height: 4.5pt;
├─GLYPH subtype: 256; char: "x"; lang: 1; width: 4.82pt; height: 4.5pt;
├─GLYPH subtype: 256; char: "x"; lang: 1; width: 4.82pt; height: 4.5pt;
├─KERN kern: -0.2pt;
├─GLYPH subtype: 256; char: "á"; lang: 1; width: 4.35pt; height: 6.76pt; depth: 0.1pt;
├─DISC subtype: regular; penalty: 50;
│ ╠═replace:
│ ║ └─KERN kern: -0.05pt;
│ ╚═pre:
│ └─GLYPH subtype: 256; char: "-"; lang: 1; width: 3.33pt; height: 2.57pt; depth: -1.94pt;
├─GLYPH subtype: 256; char: "é"; lang: 1; width: 4.44pt; height: 6.76pt; depth: 0.1pt;
├─DISC subtype: regular; penalty: 50;
│ ╠═replace:
│ ║ └─KERN kern: -0.1pt;
│ ╚═pre:
│ └─GLYPH subtype: 256; char: "-"; lang: 1; width: 3.33pt; height: 2.57pt; depth: -1.94pt;
├─GLYPH subtype: 256; char: "á"; lang: 1; width: 4.35pt; height: 6.76pt; depth: 0.1pt;
├─DISC subtype: regular; penalty: 50;
│ ╠═replace:
│ ║ └─KERN kern: -0.05pt;
│ ╚═pre:
│ └─GLYPH subtype: 256; char: "-"; lang: 1; width: 3.33pt; height: 2.57pt; depth: -1.94pt;
├─GLYPH subtype: 256; char: "é"; lang: 1; width: 4.44pt; height: 6.76pt; depth: 0.1pt;
├─DISC subtype: regular; penalty: 50;
│ ╠═replace:
│ ║ └─KERN kern: -0.2pt;
│ ╚═pre:
│ └─GLYPH subtype: 256; char: "-"; lang: 1; width: 3.33pt; height: 2.57pt; depth: -1.94pt;
├─GLYPH subtype: 256; char: "x"; lang: 1; width: 4.82pt; height: 4.5pt;
├─GLYPH subtype: 256; char: "x"; lang: 1; width: 4.82pt; height: 4.5pt;
├─GLYPH subtype: 256; char: "x"; lang: 1; width: 4.82pt; height: 4.5pt;
├─PENALTY subtype: 2; penalty: 10000;
└─GLUE subtype: parfillskip; stretch: +1fil;
-----------------------
@khaledhosny The old harf code only did reshaping around discretionary nodes if the discretionary node interacted with a ligature in the initial shaping pass. In the common case of - style discretionary nodes that just meant that ligatures/kerning involving the hyphen were ignored, for more complex discretionary uses the problems become more significant.
So to trigger the issue there, the first possible linebreak must be inside of a possible ligature.
@khaledhosny The old harf code only did reshaping around discretionary nodes if the discretionary node interacted with a ligature in the initial shaping pass.
So what string would I use to reproduce this with the old code (I’m trying to see what my code was doing wrong and try to think of a better approach).
In the common case of - style discretionary nodes that just meant that ligatures/kerning involving the hyphen were ignored, for more complex discretionary uses the problems become more significant.
My memory is fading, but I feel like this was a deliberate decision as ligatures with the hyphen would be rare, and lack of kerning would be an acceptable compromise (I don’t know any application that would apply any OpenType layout whatsoever to the automatically inserted hyphen). Or am I misunderstanding the limitation?
Also see this: https://github.com/harfbuzz/harfbuzz/pull/3297