VulkanSceneGraph icon indicating copy to clipboard operation
VulkanSceneGraph copied to clipboard

vsg::convert_utf does not work properly on platforms where wchar_t is not equivalent to char32_t

Open AnyOldName3 opened this issue 3 months ago • 1 comments

On some platforms, wchar_t is a 32-bit type, and the same width as char32_t, intended to hold UCS4/UTF-32 code points as fixed-width strings. On others, in particular, ones that attempted to support Unicode in the 90s, when UTF-8 and UTF-16 hadn't been invented and The Unicode Consortium thought that sixteen bits would be enough to hold any character from any writing system humans had ever used, wchar_t is a 16-bit type and the same width as char16_t, intended to hold UCS2 fixed-width strings or UTF-16 variable-width strings.

src/vsg/io/convert_utf.cpp works under the assumption that wchar_t can hold an entire Unicode code point on its own, which isn't guaranteed. This can be easily demonstrated by attempting to convert strings containing emoji between std::string and std::wstring in either direction on Windows, as most emoji occupy code points above 65536, and Windows is one of the platforms where wchar_t is sixteen bits. When converting wide strings to narrow, the unpaired surrogates are converted to three bytes each, giving six nonsensical code units per code point, instead of glued to their partner and converted to a combined four correct code units. When converting narrow strings to wide, the four code units are correctly converted to the right code unit held in a uint32_t, then static_casted into wchar_t, which truncates the most significant sixteen bits, which works for the first 65536 code points (which is most non-emoji text, hence why it's not been noticed), and then wraps around.

I noticed this because I was poking around, and have seen this bug lots of times in different projects, and not because I'm affected by it, so there's no pressing need to fix this immediately, but it'll end up affecting someone eventually.

AnyOldName3 avatar Mar 21 '24 18:03 AnyOldName3