dosbox-staging Support for UTF-8 locale (2nd generation patch)

From now on, *.lng, files are considered as UTF-8 locales.

Since probably no current Unicode engine handles all the FreeDOS code pages (and does it well), I had to introduce my own solution. Currently all FreeDOS code pages are supported, but some with small gaps (due to missing characters in Unicode standard, due to unidentified DOS code page characters); modifying the definitions/rules should be easy, see files:

https://github.com/dosbox-staging/dosbox-staging/blob/fc/utf8-locale-2/contrib/resources/mapping/MAIN.TXT https://github.com/dosbox-staging/dosbox-staging/blob/fc/utf8-locale-2/contrib/resources/mapping/ASCII.TXT

To see the benefits of the engine, switch the language to Polish (language=pl in the configuration file), and try commands:

keyb pl 667
keyb pl 668
keyb lv 775
keyb uk 437

667 and 668 are mutually incompatible code pages for Polish language. 775 is a Baltic Rim code page - not designed for Polish at all, but by chance it contains all (or almost all) national characters needed. 437 is an US code page, it contains almost no national characters. After each of these commands, you'll see the message is now printed out using the newly set code page; also, for code page 437, we have a reasonable fallback for each glyph.

Screenshot

Currently the conversion is one-way (from UTF-8 to guest side code page), but it should be easy to extend the engine for converting characters the opposite way (some commented-out example already exists in the code). Possible future uses of this code I can think of:

handling national characters in file names, with host OS using UTF-8
there were some discussions about TTF output
better handling of physical mouse names returned by ManyMouse library (it returns names in UTF-8)

Aug 25 '22 19:08 FeralChild64

Was able to hit an assert (that's a good thing!), but just wanted to report it dosbox -lang pl:

../../src/misc/unicode.cpp:944: bool construct_mapping(const uint16_t): 
Assertion `!config_mappings.count(code_page)' failed.

Aug 31 '22 18:08 kcgen

@kcgen Mistake in assert, the condition should be exactly opposite. To construct the UTF-8 mapping for the given code page, we actually expect the appropriate configuration should exist.

Fixed, thank you for noticing.

Aug 31 '22 19:08 FeralChild64

Converted to draft - it seems we need to handle normalization issues, most likely in the encode.sh script (result of discussion on Discord, comment by sherm_p).

Sep 01 '22 19:09 FeralChild64

I would like to know more about impact on software compatibility. Like can FreeDOS FreeCOM still be run?

Sep 05 '22 07:09 Grounded0

@Grounded0: There should be no impact on software compatibility. FreeCOM should run as before.

Just note, that only DOSBox internal tools can profit from UTF-8 locale; any guest code is on it’s own. AFAIK the FreeCOM has locale selected during compilation time.

Sep 07 '22 09:09 FeralChild64

That's good enough for me.

Sep 07 '22 09:09 Grounded0

This is an amazing addition, @FeralChild64! The parser is very robust and flexible, and I have no doubt the community can help maintain the suite of code pages, being able to check for updates.

Sep 20 '22 19:09 kcgen

Sanitizer jobs are all passing!

Tentatively approved.

Given the size and complexity, I would appreciate @johnnovak and @Wengier to take a review pass; both having more experience with internationalization details than myself as well.

Sep 20 '22 19:09 kcgen

@kcgen Thank you. Fortunately, not that much is left regarding the concrete code pages support, I'm more concerned about small mistakes I might have made. Hopefully we will get TTF output in the future, so that they will be easy to spot by comparing ASCII tables displayed by DOS software.

Sep 20 '22 20:09 FeralChild64

Many thanks to @FeralChild64 for the code page parser! It can certainly be described as very robust and useful for parsing code pages. Just a few small suggestions.

Sep 21 '22 02:09 Wengier

Looks great, @FeralChild64 - all comments addressed from prior review cycles (all reviewers included). It's passing everything I can throw at it with ASAN and UBSAN.

Let's get this merged and put it to use.

First big addition to 0.80!

Oct 02 '22 17:10 kcgen

dosbox-staging dosbox-staging copied to clipboard

Support for UTF-8 locale (2nd generation patch)

dosbox-staging
dosbox-staging copied to clipboard