ryzomcore icon indicating copy to clipboard operation
ryzomcore copied to clipboard

Deprecate `ucstring` in favour of using UTF-8 everywhere

Open ryzom-pipeline opened this issue 6 years ago • 9 comments

Original report by Jan Boon (Bitbucket: [Jan Boon](https://bitbucket.org/Jan Boon), ).


There's no advantage in using ucstring, since the 2-byte format is insufficient for the full Unicode space.

It’s easier to deal with just using UTF-8 everywhere.

ryzom-pipeline avatar May 01 '19 10:05 ryzom-pipeline

Original comment by Cédric Ochs (Bitbucket: [Cédric OCHS](https://bitbucket.org/Cédric OCHS), ).


Sure, it needs 4-bytes to support full Unicode glyphs, but most of them (asian, latin, arabian, etc…) are using 2-bytes.

And I see advantages using ucstring instead of UTF-8 ;p When you need to manipulate the text data. For example, you have a 3 byte UTF-8 such as “Hé”, you want to keep the 2 first characters, your UTF-8 string will be invalid. Another case I see is the length of the text is wrong. In UTF-8, this string will be 3 characters while it will be 2 characters with an ucstring. That’s not for nothing Qt, MFC, etc… are all using internal 2-bytes or 4-bytes strings.

QString stores a string of 16-bit QChars, where each QChar corresponds to one UTF-16 code unit. (Unicode characters with code values above 65535 are stored using surrogate pairs, i.e., two consecutive QChars.)

ryzom-pipeline avatar May 01 '19 10:05 ryzom-pipeline

Original comment by Jan Boon (Bitbucket: [Jan Boon](https://bitbucket.org/Jan Boon), ).


Yes, UTF-16 still has surrogate pairs, so there’s no advantage vs UTF-8.

The same issue of counting the number of glyphs occurs with UTF-16 when characters have 4 bytes. The length of the text will also be wrong.

Qt and Windows use UTF-16 for historical reasons, since UTF-32 did not yet exist back then. Neither will give the glyph length, they will just give the data length in 2 bytes.

It’s easy to implement a glyph count and text operations for UTF-8 where necessary.

See http://utf8everywhere.org/.

ryzom-pipeline avatar May 01 '19 19:05 ryzom-pipeline

Original comment by Cédric Ochs (Bitbucket: [Cédric OCHS](https://bitbucket.org/Cédric OCHS), ).


Thanks for the link, it makes sense :slight_smile: You’ll notice that wchar_t is 4 bytes long under UNIX while it’s 2 bytes under Windows ;p Apparently Nevrax chose to use a 2 bytes ucchar because Windows was using it (for historical reasons as mentioned in your link).

ryzom-pipeline avatar May 02 '19 08:05 ryzom-pipeline

Original comment by Jan Boon (Bitbucket: [Jan Boon](https://bitbucket.org/Jan Boon), ).


By the way, in Windows 10 it’s now possible to set your locale character set to UTF-8. Hopefully they might push that through as a default, or more prominent option, at some point.

When that’s the case GetACP() will return CP_UTF8, and the “ANSI” Win32 functions will accept UTF-8 directly (and then conversions to “widechar” are no longer necessary on systems which have it enabled).

ryzom-pipeline avatar May 02 '19 21:05 ryzom-pipeline

Original comment by Cédric Ochs (Bitbucket: [Cédric OCHS](https://bitbucket.org/Cédric OCHS), ).


Yes, I noticed that too :slight_smile: That’s a good news :)

ryzom-pipeline avatar May 03 '19 07:05 ryzom-pipeline

Original comment by Jan Boon (Bitbucket: [Jan Boon](https://bitbucket.org/Jan Boon), ).


Seems ucstring conversion functions are UCS-2, and not actually supporting UTF-16 (so ucstring is a lossy Unicode).

ryzom-pipeline avatar May 09 '19 01:05 ryzom-pipeline

Original comment by Cédric Ochs (Bitbucket: [Cédric OCHS](https://bitbucket.org/Cédric OCHS), ).


Yes :disappointed: For the serialization, it seems to work, because we tested it on PowerPC Mac some years ago :)

ryzom-pipeline avatar May 09 '19 07:05 ryzom-pipeline

Original comment by Jan Boon (Bitbucket: [Jan Boon](https://bitbucket.org/Jan Boon), ).


There's no advantage in using ucstring, since the 2-byte format is insufficient for the full Unicode space.

It’s easier to deal with just using UTF-8 everywhere.

ryzom-pipeline avatar May 14 '19 06:05 ryzom-pipeline

Add

  • CUtfStringView: Reference to an UTF-8 or UTF-32 string, providing an iterator that outputs 32-bit code-points. Use when processing text for rendering, so UTF-8 can be used directly.
  • u32String: UTF-32 string for editable text.

kaetemi avatar Oct 24 '20 13:10 kaetemi

This is done, pending changes to network.

kaetemi avatar Feb 20 '23 23:02 kaetemi