Deprecate `ucstring` in favour of using UTF-8 everywhere
Original report by Jan Boon (Bitbucket: [Jan Boon](https://bitbucket.org/Jan Boon), ).
There's no advantage in using ucstring, since the 2-byte format is insufficient for the full Unicode space.
It’s easier to deal with just using UTF-8 everywhere.
Original comment by Cédric Ochs (Bitbucket: [Cédric OCHS](https://bitbucket.org/Cédric OCHS), ).
Sure, it needs 4-bytes to support full Unicode glyphs, but most of them (asian, latin, arabian, etc…) are using 2-bytes.
And I see advantages using ucstring instead of UTF-8 ;p When you need to manipulate the text data. For example, you have a 3 byte UTF-8 such as “Hé”, you want to keep the 2 first characters, your UTF-8 string will be invalid. Another case I see is the length of the text is wrong. In UTF-8, this string will be 3 characters while it will be 2 characters with an ucstring. That’s not for nothing Qt, MFC, etc… are all using internal 2-bytes or 4-bytes strings.
QString stores a string of 16-bit QChars, where each QChar corresponds to one UTF-16 code unit. (Unicode characters with code values above 65535 are stored using surrogate pairs, i.e., two consecutive QChars.)
Original comment by Jan Boon (Bitbucket: [Jan Boon](https://bitbucket.org/Jan Boon), ).
Yes, UTF-16 still has surrogate pairs, so there’s no advantage vs UTF-8.
The same issue of counting the number of glyphs occurs with UTF-16 when characters have 4 bytes. The length of the text will also be wrong.
Qt and Windows use UTF-16 for historical reasons, since UTF-32 did not yet exist back then. Neither will give the glyph length, they will just give the data length in 2 bytes.
It’s easy to implement a glyph count and text operations for UTF-8 where necessary.
Original comment by Cédric Ochs (Bitbucket: [Cédric OCHS](https://bitbucket.org/Cédric OCHS), ).
Thanks for the link, it makes sense :slight_smile: You’ll notice that wchar_t is 4 bytes long under UNIX while it’s 2 bytes under Windows ;p Apparently Nevrax chose to use a 2 bytes ucchar because Windows was using it (for historical reasons as mentioned in your link).
Original comment by Jan Boon (Bitbucket: [Jan Boon](https://bitbucket.org/Jan Boon), ).
By the way, in Windows 10 it’s now possible to set your locale character set to UTF-8. Hopefully they might push that through as a default, or more prominent option, at some point.
When that’s the case GetACP() will return CP_UTF8, and the “ANSI” Win32 functions will accept UTF-8 directly (and then conversions to “widechar” are no longer necessary on systems which have it enabled).
Original comment by Cédric Ochs (Bitbucket: [Cédric OCHS](https://bitbucket.org/Cédric OCHS), ).
Yes, I noticed that too :slight_smile: That’s a good news :)
Original comment by Jan Boon (Bitbucket: [Jan Boon](https://bitbucket.org/Jan Boon), ).
Seems ucstring conversion functions are UCS-2, and not actually supporting UTF-16 (so ucstring is a lossy Unicode).
Original comment by Cédric Ochs (Bitbucket: [Cédric OCHS](https://bitbucket.org/Cédric OCHS), ).
Yes :disappointed: For the serialization, it seems to work, because we tested it on PowerPC Mac some years ago :)
Original comment by Jan Boon (Bitbucket: [Jan Boon](https://bitbucket.org/Jan Boon), ).
There's no advantage in using ucstring, since the 2-byte format is insufficient for the full Unicode space.
It’s easier to deal with just using UTF-8 everywhere.
Add
- CUtfStringView: Reference to an UTF-8 or UTF-32 string, providing an iterator that outputs 32-bit code-points. Use when processing text for rendering, so UTF-8 can be used directly.
- u32String: UTF-32 string for editable text.
This is done, pending changes to network.