OpenTTD
OpenTTD copied to clipboard
Replace ICU ParagraphLayout with something else?
OpenTTD currently uses the ICU ParagraphLayout API for doing internationalized wordwrapping. This usage is a problematic, since ICU deprecated and removed their layout API for being buggy and unmaintained. The ParagraphLayout API that OpenTTD uses is still present in ICU, but it relies on the removed Layout API. This gap can be be filled using the icu-le-hb library, which implements the layout API on top of the Harfbuzz library.
However, it seems icu-le-hb is more of a proof of concept and not really maintained either. Additionally, because ParagraphLayout is in ICU and icu-le-hb depends on ICU, this gives an undesirable circular dependency, which is problematic for at least Debian, since it complicates updating to new ICU versions,.
See https://bugs.debian.org/894159 for more info on this, in particular https://bugs.debian.org/894159#45 and onwards. The Debian version of this issue is https://bugs.debian.org/897233
It seems it would be beneficial to move away from the ICU ParagraphLayout API, so it can be removed from Debian (OpenTTD is the only user in Debian), but also because it has indeed proven to be buggy in the past (OpenTTD talks about ICU-related crashes, though this was before the icu-le-hb implementation).
In the above linked bug, the following was written:
Another solution is of course to disable ParagraphLayout. László also asked if OpenTTD, being the only user of this API, could migrate to another solution. I've discussed this with OpenTTD upstream yesterday, and they were already aware of the layout API removal and have been casually looking at Harfbuzz and Pango as a replacement, but they do not see an easy solution yet. ParagraphLayout seems to fit their usecase quite neatly: they need internationalized word-wrapping of text (e.g. also supporting right-to-left locales). Harfbuzz does not seem to offer that, and Pango seems heavy-handed (and might not be easy to adapt to OpenTTD's SDL renderer, and might not be portable enough). This is bad to read. I had the hope there's an easy solution and/or the replacement might be already in the making.
Neither me or upstream has much experience in this field, perhaps you have a different suggestion for an alternative? I don't know any other alternative. Only OpenTTD uses the Paragraph Layout API and it makes me wonder what other solution the other projects use? I may think other games like Lincity-NG[3] also need internationalized text placement and/or LibreOffice still need to handle this as well. Do these have an alternative solution?
Since the above was written, the ICU-in-Debian folks have asked about any progress on the OpenTTD side of things, since they really like us to remove this dependency from OpenTTD.
This has been discussed on IRC on a few occasions. Some observations:
- The ICU layout stuff was removed in ICU 58.
- Pango does layouting, but does not handle sorting, collation, etc. So when using Pango for layout, we'd need ICU anyway. Pango is fairly heavy-handed, depending on (I think) glib and gdk rendering as well.
- IIRC Harfbuzz handles rendering of glyphs and lines, but not directionality detection and multi-line layouting.
- icu-le-hb seems to be mostly unmaintained, and might be meant more of an example implementation than a production library. I believe it is mostly an API translation layer from HB to ICU Layout, without (much) code that actually does something.
- ICU seems to be generally buggy too, so if we can stop depending on (parts of) it, that could be a feature. It seems we'll still need it for collating, but it could be that the ParagraphLayout code is the most buggy part (wishful thinking)?
- Vendoring icu-le-hb in OpenTTD does not seem to work, since we depend on ICU ParagraphLayout which depends on ICU Layout (now provided by icu-le-hb). Vendoring both icu-le-hb and the ParagraphLayout code from ICU could solve this.
- Taking the code from icu-le-hb and ICU ParagraphLayout and creating a new library from that might be an interesting approach. The code can then probably be cleaned up from there, since the ICU Layout API is no longer required, and can be removed internally. It might also not be needed to preserve the ParagraphLayout API exactly, if we feel that should be cleaned up (to be closer to the HB API for example), we can also do that.
- It might be interesting to see what other software does (e.g. LibreOffice can probably do this thing as well). Does everyone use Pango? Or do they implement their own word-wrapping on top of HB layouting?
Did anyone make any progress on this subject in the last few months? @LordAro, @TrueBrain?
I would strongly considering the option of "just winging it". Of all the languages OpenTTD is currently translated into, all except three use regular ASCII spaces to separate words and indicate acceptable line break points.
The exceptions are the two Chinese langauges, and Japanese, and for those three you can more or less insert line breaks at any point. A slightly more correct algorithm would avoid inserting breaks before punctuation characters and after opening brackets, but I think those can be looked up in the Unicode database which should be included in ICU.
The full Unicode algorithm is rather long. Implementing it in full should be outside the scope of OpenTTD, but implementing a heavily simplified version might be okay?
IIRC the ASCII-space approach is already implemented when compiling without ICU, so adapting that with a minimal approach for non-ASCII locales might be the easiest way to implement your suggestion. OTOH, in this case we want ICU support, but not ICU-layout support, so that might require more invasive changes. Hm.
Also, I'm not entirely sure if word-breaking is the only (or hardest) part of ICU-layout that needs to be replaced.
I guess more investigation in the current code is needed, but I can't really justify the time needed for that...
OTOH, in this case we want ICU support, but not ICU-layout support, so that might require more invasive changes.
Just had a look at the code, it seems that there's already a split between WITH_ICU_SORT
and WITH_ICU_LAYOUT
. But without ICU layout, the FallbackParagraphLayout
class is used, which apparently supports no RTL layouting either. It does already do space-based line breaking (in NextLine
) and in absence of spaces breaks whenever the line is full (which would suffice for your "you you can more or less insert line breaks at any point" proposal). It already seems to support font-switching as well (e.g. drawing a line consisting of multiple "runs" with different fonts), which seems to be handled externally to the layouter and passed on the constructor through runs
: https://github.com/OpenTTD/OpenTTD/blob/9cf999bca7d6e1b53ffaf62c72a0abb65aa0211c/src/gfx_layout.cpp#L466
If we take this route, I wonder if we still need Harfbuzz? These (ancient) Harfbuzz docs suggest Harfbuzz only does shaping of text (given a run of text, a font, language, etc. decide what glyphs to use and how to position them).
I believe the shaping in the fallback layouter is fairly simple and consists of simply summing the character widths (and I suspect the glyph selection is also a simple direct mapping of Unicode code points to glyphs in the font). I'm not sure if this would be sufficient or would really need something like Harfbuzz (which also poses the question how this interacts with the fonts we use, which I suspect are custom OpenTTD bitmap fonts, or do we use freetype stuff?).
Shaping of text is important for various languages/writing systems, most visible for Arabic, but also some indian and asian scripts. Also, proper handling of mixed RTL/LTR content (e.g. English NewGRF texts embedded in arabic) is somewhat complicated.
Aditionally, even for latin-based scripts, there are some things which need combining diacritical glyphs.
For the Windows Uniscribe implementation I took the approach to handle the Bitmap font with the fallback layouter, as our bitmap font doesn't include anything that needs complex glyph shaping. If you select a language that needs more glyphs, we try to switch to a FreeType font anyway.
I don't think anyone will work on a replacement as long as the current one is in place.
So, for debian I recommend to configure with "--without-icu-layout". That way we may generate some interest/demand in actually supporting complicated languages.
I tried running without ICU layout and compared the results with the language set to Arabic. I was actually expecting the layout and word wrapping to become worse, but it actually seems the entire glyph selection changed as well. E.g., this is the settings screen without ICU layout:
And here's the settings screen with ICU layout:
For easy comparison, here's a closeup. Without:
With:
I don't read Arabic, so it looks to me like the text is completely different (and if it is correct with ICU layout, it would be likely that it is wrong without ICU layout). Perhaps the text is essentially the same, but some characters are intended to kerned or otherwise combined, changing their appearance?
I suspect that we are now no longer using ICU for any layouting, while I believe we could still use ICU for glyph selection and maybe some other stuff too?
Any Arabic speakers reading along here that can tell me how the text in the two versions compares? Is the second one incorrect, or just a bit less elegant? Does it even say the same thing?
Another thing is that on startup, OpenTTD now shows an error message saying there is no support for RTL-languages (which might not be entirely correct, there is at least some support) and suggesting to compile with ICU enabled (which might be counterproductive).
Summarizing: I'm not quite ready to drop ICU layout support for the Debian packages just yet, this needs a bit more investigation (but I am intending to drop it, as suggested by Frosch to perhaps force some progress, or at least make the life of the ICU Debian maintainers a bit easier).
The first image is completely broken (the text is left-to-right instead of right-to-left, and the glyphs has the wrong contextual shapes), the second one is mostly correct but the mixed Arabic and English text seems to be incorrectly re-ordered (the order of the Arabic and English text makes no sense, which likely means bidirectional text is incorrectly set up).
This is surprising, since the ICU layout version should be the correct, which makes me wonder if the source text is correct (can you kink to it) or whether some thing is broken with ICU layout or how it is used.
This is surprising, since the ICU layout version should be the correct, which makes me wonder if the source text is correct (can you kink to it) or whether some thing is broken with ICU layout or how it is used.
Oh, I swapped the images, w00ps. I just updated thet comment to fix the labeling (the images are still in the same order, so your comment still applies).
So indeed, disabling ICU layout completely breaks RTL languages, then. That's a bit more severe than I had hoped, but perhaps fixing glyph selection / shaping and adding RTL detection would not be too hard.
I also compared the Chinese (Traditional) language, and as @nielsmh predicted that only has some small word-wrapping differences, but otherwise looks identical.
Looking at the ICU API, it seems it still offers some useful building blocks that could be used to implement the missing piecies:
- BreakIterator for finding linebreak positions and character separation points (for editing).
- Bidirectional algorithm, which is an implementation of the Unicode bidi algorithm, figuring out what characters are LTR and RTL.
- Shaping for Arabic text. I wonder if other languages also need shaping?
An alternative for shaping could be to use Harfbuzz, which AFAIU supports (only) shaping. It also seems to suggest shaping is needed for more than arabic (also syriac, indic, thai, ...), but perhaps that's what ICU means with "arabic languages"?
I also checked Hebrew, the other RTL language supported by OpenTTD now, which also has the character order reversed (but the glyphs seem to be the same, so it probably requires no shaping).
Arabic languages are quite specifically the languages using the same family of writing systems. There's some differences in orthography and alphabet but I believe they are all supported by a single shaping engine and Unicode block. The ICU u_shapeArabic
function is not really suited for GUI applications, what it does is basically replace "natural" Unicode code points in the text with "compatibility" code points representing pre-shaped Arabic text. As they write in the documentation linked, it's mainly intended for character terminals and similar where a full Unicode text shaping system isn't practical, but Arabic text is still required.
Indian languages like Devanagari (easily identified by their "hanging baseline") are a separate beast, and while they don't require the bi-directional algorithm, they instead require different, complex reordering of glyphs. They depend on tables in the font data for this, and for combining multiple code points into single glyphs. (The latter should be very similar to OpenType automatic ligatures for Latin alphabet text.)
The conclusion must be, if only ICU is available then u_shapeArabic
could be used for Arabic, but if a full Harfbuzz-based shaping engine is available then that should be used instead. The game should probably prevent the user from choosing a UI language without layout support too.
So, it sounds like adding harfbuzz as a dependency for shaping makes sense. We'll need something else for the bidirectionality detection then - The Harfbuzz docs suggest using ICU or fribidi. For linebreaking, the Harfbuzz docs also suggest using the ICU break iterator API.
I'm about to upload a version to Debian that compiles without the ICU layout API. After that, I probably won't find the time to actually implement Harfbuzz in OpenTTD. Any takers for this project?
I haven’t checked what ICU Layout does exactly, but ICU + HarfBuzz should have all the API one needs to replicate its functionality, so I think it isn’t a bad idea for OpenTTD to just copy it, replace the uses of the deprecated API with direct HarfBuzz calls and always use it. ICU shaping is limited and supports only a subset of Arabic and in general is of limited use, HarfBuzz should be used instead.
I had my concerns about Harfbuzz (& Pango) previously, as they pull in a lot of extra dependencies via glib. However, now that ICU is Linux-only (well, not Windows or OSX) anyway, it's probably less of an issue
Should we perhaps flag UI languages that require shaping in a way that prevents them from being used in builds that lack complex text support?
An additional issue, tangential to this, is: What about user-supplied text, what is the primary reading direction for that? Things such as company names, signs, and chat messages. Just assume the UI language directionality, or somehow extend user-supplied strings with input language used? For instance, a German player downloading a savegame from a Saudi-Arabian player, that might cause some labels to render wrong.
I had my concerns about Harfbuzz (& Pango) previously, as they pull in a lot of extra dependencies via glib.
HarfBuzz does not require GLib (Pango does), it has it an optional dependency for Unicode functions, alternative dependency for this is ICU so projects using ICU naturally use it for Unicode functions in HarfBuzz.
https://packages.debian.org/stretch/libharfbuzz0b
Doesn't seem to be what Debian (or Arch) thinks...
What about user-supplied text, what is the primary reading direction for that?
AFAIU, the direction can be derived from the individual characters. Currently, mixing RTL and LTR text in a single string is, AFAIK already used (for things like "OpenTTD" or "readme.txt" inside a translated string, see the screenshots a few comments back) and supported (using ICU, possibly through ICU layout to figure out what parts are RTL and what parts are LTR). So, that should work out, I believe.
@LordAro, I suspect you can choose to compile HB with or without glib integration. Debian uses with glib, but since it is easy to install it, there's only a bigger install footprint for using HB on Debian, but if you need to compile everything yourself, you can just leave it out.
This issue has been automatically marked as stale because it has not had any activity in the last two months. If you believe the issue is still relevant, please test on the latest nightly and report back. It will be closed if no further activity occurs within 7 days. Thank you for your contributions.
Although we all agree this needs a solution, it is not happening for 1.9. Another attempt in 1.10?
Sorry, I hope this is not too much of an issue :(
Aaand bumping again! Turns out rewriting the text rendering engine is quite a lot of work
I don't know if it helps since we don't understand BiDi that well, but for OpenRCT2 we used FriBidi to reorder Arabic. This was implemented in https://github.com/OpenRCT2/OpenRCT2/pull/12837.
oh hi
it seems icu-le-hb is more of a proof of concept and not really maintained either
or it's just done and there's not much more to do with it.
Additionally, because ParagraphLayout is in ICU and icu-le-hb depends on ICU, this gives an undesirable circular dependency, which is problematic for at least Debian
It's painful for everyone. I have https://unicode-org.atlassian.net/browse/ICU-6829 filed with the at-that-time ideas. But I'm no longer funded for ICU. Pull requests welcome if anyone wants to help out.