allsorts Support for reporting shaping boundaries?

In order to avoid redoing shaping work when doing line breaking, it would be very useful for the client of the shaping library to query for what I'm here calling "shaping boundaries." This is reported in (recent) DirectWrite shaping through the canBreakShapingAfter flag, which is very well explained by Sergey Malkin in https://github.com/harfbuzz/harfbuzz/issues/1463#issuecomment-505592189.

I had originally thought (and blogged) that I could use the HarfBuzz UNSAFE_TO_BREAK flag, but a careful read of that issue reveals that it is overly optimistic.

Having support for this would dramatically increase my interest in using an alternate shaping engine. Looking back over past work, I now realize this is one of the things I got kinda stuck on. I now can see how to break through it using hacks and approximations, but it would be very gratifying to get a solution that is both principled and fast.

Sep 04 '20 06:09 raphlinus

It is certainly our ambition to be both principled and fast, although that can be a daunting proposition where fonts and complex script shaping are concerned! We will put some thought into this issue for next year.

Sep 04 '20 07:09 mikeday

Good to hear. This feature is primarily motivated by the needs of interactive GUI (where you might relayout a paragraph many times to deal with resize), but might possibly be useful for Prince as well, to avoid width measurement errors for line breaking in cases such as Arabic with very long ligatures. I'll keep a watch out for progress, and in the meantime stick with heuristics.

Sep 04 '20 13:09 raphlinus

The simplest shaping boundary to describe would be something like this:

shape : list Char → list Glyph

safe_split(As, Bs) :-
    shape(As ++ Bs) = shape(As) ++ shape(Bs).

A safe split point is where splitting the input string and shaping both halves separately produces the same output as shaping the entire string. This allows cheap line-breaking simply by splitting the output glyph array without reshaping any text, and line-breaking anywhere else only requires reshaping from the last safe split point up to the next safe split point.

Detecting safe split points could be done by a fairly simple algorithm during shaping: start by assuming every possible split point is safe and then disable the ones crossed by a ligature or a pairwise positioning adjustment like kerning or cursive connection or more subtly any that fall within the lookahead or backtrack of a successful contextual lookup. Complex scripts could impose additional constraints, for example the Indic shaper might disable all split points within a syllable.

This approach should be adequate for a client doing text fitting in response to resize or Knuth-style paragraph breaking, however I think a more powerful approach is needed if the client wants to mutate the text without reshaping the entire buffer.

The safe_split predicate described above treats the input as fixed and so it only needs to consider the lookups that were actually applied, but now we also need to consider the lookups that didn't apply but could have if the input was mutated:

safe_split_mutate(As, Bs) :-
    safe_split_mutate_after(As),
    safe_split_mutate_before(Bs).

safe_split_mutate_after(As) :- all [Suffix] safe_split(As, Suffix).

safe_split_mutate_before(Bs) :- all [Prefix] safe_split(Prefix, Bs).

(Please pardon the Prolog syntax, all [Var] Goal is universal quantification and Goal1, Goal2 is conjunction).

safe_split_mutate identifies points where you can safely split the input and shape the halves independently to get the same output even if the input is modified after splitting, i.e. you don't need to reshape across the split point. You can also check things like safe_split_mutate_after(Cs) to see if appending characters to the input might require reshaping, if that's an interesting property to know. It's a subset of safe_split because for example it would never allow a split after an 'f' as that would require reshaping across the split if an 'i' was inserted after it.

Actually implementing this predicate would be substantially more difficult; it would need to consider all of the possible lookups and not just the lookups that were actually applied to the input so far, which would probably require compiling them into a form that would make this process more efficient.

Looking at Sergey Malkin's explanation of how DirectWrite works it seems similar to safe_split in principle, except it's not quite clear to me from this description whether it's entirely accurate (no heuristics) or just accurate enough for the fonts that you would expect to encounter in practice. Also it offers this guarantee:

If x=[A|B|C] and y=[B|D] => [A|B|D] (which in turn is equal to [A]+[B]+[D]). This means than if tail of a string changed, we can shape only small part of a line before changed text and verify that this prefix didn't change.

This property doesn't hold for the safe_split predicate; it does hold for safe_split_mutate, but I think the DirectWrite approach is somewhere between the two:

safe_split_segs(As, Bs) :-
    safe_split(As, Bs),
    all [Suffix] safe_split_segs(Bs, Suffix) => safe_split_segs(As, Bs ++ Suffix),
    all [Prefix] safe_split_segs(Prefix, As) => safe_split_segs(Prefix ++ As, Bs).

If I've got this right it should say that we can split the input into two segments such that appending or prepending another segment will not invalidate the split, or to put it another way if you have segments [A|B|C|D] you can mutate A or D without reshaping across the break at [B|C], thus you only need to backtrack/lookahead at most one segment.

These three predicates form progressively stricter subsets of split points:

safe_split_mutate ⊆ safe_split_segs ⊆ safe_split

I'm guessing that safe_split_segs might well be a reasonable compromise between the three in terms of power vs. complexity, although safe_split is surely the simplest to implement and to test.

This handwave of an idea is a long way from a formal specification, but do you think it's heading in the right direction so far?

Sep 06 '20 12:09 mikeday

I had originally thought (and blogged) that I could use the HarfBuzz UNSAFE_TO_BREAK flag, but a careful read of that issue reveals that it is overly optimistic.

I will implement the proposed UNSAFE_TO_CONCAT bit, which matches what Sergey described.

Sep 07 '20 00:09 behdad

Michael: Your safe_to_split is what HarfBuzz calls !UNSAFE_TO_BREAK. The proposed !UNSAFE_TO_CONCAT is closer to your safe_split_mutate_after/before combined.

Sep 07 '20 00:09 behdad

Here is a quick example to clarify my thoughts: take a font that maps characters to glyphs on a one-for-one basis except for the "ffi" ligature and find the possible split points in the word "effort".

Since no ligatures are present in the word, safe_split will identify every split point as valid:

.e.f.f.o.r.t.

However safe_split_mutate will be more cautious around where the ligature could appear:

.e.ffo.r.t.

While safe_split_segs actually has two possible solutions:

.e.ff.o.r.t.
.e.f.fo.r.t.

From this example we can see that safe_split_mutate guarantees that modifying any segment will have no effect on the shaping of any other segment, while safe_split_segs offers the weaker guarantee that modifying any segment will at most affect adjacent segments.

Sep 07 '20 02:09 mikeday

Michael and Behdad both have the understanding of my original concern right, the ~~safe_split_mutate~~ [ETA: see below] predicate is a universal quantification, which is what we want. Actually for my purposes we have a pretty good idea what the prefixes and suffixes might be, they're substrings of the original paragraph, but it's not clear that buys you all that much here. (And even then, I can think of exceptions, like adding a hyphen character).

It would be an awesome problem to have this in both HarfBuzz and Allsorts, so be forced to choose on other criteria :)

Sep 07 '20 02:09 raphlinus

You cannot play us against each other like that! :laughing:

So you actually want stricter split point detection than what DirectWrite offers? Is that motivated by efficiency or convenience or theoretical purity?

Sep 07 '20 03:09 mikeday

So you actually want stricter split point detection than what DirectWrite offers? Is that motivated by efficiency or convenience or theoretical purity?

No, now that I'm rereading the issue more closely, I think the DirectWrite definition is a very close fit for what we need. Just as a note, we're not planning to support incremental reshaping after text mutation in our API, just rewrap, based on our evaluation that the latter will happen a lot more frequently and in bulk than the former, so it's possible DirectWrite is even more careful than we need to be. But on the other hand, potential reshaping after adding a hyphen is very similar to the guarantee needed for general mutation, so I don't immediately see any useful way to weaken the predicate further, providing an intermediate between safe_split_segs and safe_split.

I was typing the following when I saw your last comment:

Because Michael framed the problem in such a delightful way, I couldn't help but try to solve the implied logic puzzle of coming up with concrete examples where the ⊆ is proper. I think we all understand safe_split_segs ⊆ safe_split as that's addressed in the original issue, but safe_split_mutate ⊆ safe_split_segs is a little trickier. I agree that safe_split_segs captures the numbered points in Sergey's comment, though the "when unchanged between calls, any text on either side doesn't affect shaping on another side" might be interpreted as saying something stronger.

If the font has a "BD" and an "ABD" ligature but no others, then I believe safe_split_segs("A", "BC") holds (which I believe is the same as saying [A|BC] in Sergey's notation), but safe_split_mut("A", "BC") doesn't, because safe_split_mutate_after("A") is violated by ("A", "BD").

The interesting thing is that in the absence of the "BD" ligature, the shaping engine has two different choices, both of which are fine as long as the choice is made consistently. Either it could falsify safe_split_segs("B", "D") and assert safe_split_segs("A", "B"), which is ok because the implication safe_split_segs("B", "D") => safe_split_segs("A", "BD") is false => false; or it could assert safe_split_segs("B, D") and falsify safe_split_segs("A", "B"). I think in both cases safe_split_segs("A", "BC") is still ok.

Now that I'm thinking about it, I can start to see how the latter path is desirable for a shaping engine, as it would naturally fall out of a trie lookup for ligatures. But I'm not going to go too deep down this road.

For anyone who implements this flag, I have three words of advice: property testing. (The middle word might not render, but trust me, it's there)

Sep 07 '20 03:09 raphlinus

As I posted in the chat last week: Screenshot from 2020-09-07 13-38-32

Sep 07 '20 03:09 mikeday

Michael and Behdad both have the understanding of my original concern right, the ~safe_split_mutate~ [ETA: see below] predicate is a universal quantification, which is what we want. Actually for my purposes we have a pretty good idea what the prefixes and suffixes might be, they're substrings of the original paragraph, but it's not clear that buys you all that much here. (And even then, I can think of exceptions, like adding a hyphen character).

Still reading your comments further down. But having the universal predicate (ie. safe_split_mutate) is more useful for line-breaking German etc where you end up adding different segment at break point...

Sep 07 '20 15:09 behdad

But on the other hand, potential reshaping after adding a hyphen is very similar to the guarantee needed for general mutation, so I don't immediately see any useful way to weaken the predicate further, providing an intermediate between safe_split_segs and safe_split.

Yep. My stance as well.

Sep 07 '20 15:09 behdad

Here's my analysis, which is really just an interpretation of what Sergey said. I think safe_split_segs is a useful predicate even in the German case. What happens is that the client of the library has to make a heuristic guess where the splice point will be. This affects performance but not correctness. If it doesn't move far enough back, then the predicate will come back showing no viable boundaries, and it will have to retry reaching farther back in the string. If it moves too far back, then it's fine, but it has also wasted work reshaping more of the string than it needed to.

So it exports some of the complexity to the client. The question is whether this a useful tradeoff. I believe the answer is yes, because the full safe_split_mutate would (a) likely require more pre-analysis of the font, and (b) report (maybe a lot) fewer shaping boundaries. That's in addition to the implementation complexity, which is a serious concern.

I personally am fine with this if the tradeoff indicates it's worthwhile. I can see pretty clearly how to do a good heuristic, and we'll need to do some similar analysis of the text in any case, for example for automatic hyphenation. In the worst case, we can apply machine learning to the task. (This is a bit of a joke, but maybe some really simple Bayesian model trained from real data is not)

It wasn't asked for, but here's my strong advice: start this work by doing the property testing framework. In the case of HarfBuzz, focus in especially on the cases where UNSAFE_TO_BREAK fails the property tests for safe_split_segs and safe_split_mutate. (Also, verify that it meets the test for safe_split as expected, though I wouldn't consider fixing any problems there a priority; the main thing is to learn something). Gather some evidence, or at least intuition, about how much stronger a predicate safe_split_mutate is than safe_split_segs. Obviously run this on some challenging fonts, such as Nastaliq, the various southeast Asian fonts, and Numderline.

I actually find this problem motivating, even though it's complex and arguably only makes a substantial difference in fairly obscure edge cases. I look forward to blogs explaining how the magic works. But it's also not urgent, we'll certainly get some stuff in that works most of the time but doesn't perfectly handle shaping with spaces.

Sep 07 '20 17:09 raphlinus

I've given this a try in https://github.com/harfbuzz/harfbuzz/pull/3297 now.

Nov 16 '21 23:11 behdad