ACE_TAO
ACE_TAO copied to clipboard
ACE/TAO Wide Strings on Linux
Discussed in https://github.com/DOCGroup/ACE_TAO/discussions/2144
Originally posted by wkbrd October 16, 2023 We are attempting to use wide strings (CORBA::WChar*) on Linux, however we are finding that the encoding does not seem to behave as expected for characters whose representations do not fit into a single UTF-16 character.
On Linux, wchar_t is a 32-bit type. When a wide string comes over the wire, including from other ORBs, the default behavior is that each character in the string is the next 16-bit unit of the UTF-16 representation of the string, rather than the next character in the native UTF-32 layout. Is this intended?
For example, 🂡🂮🂭🂫🂪 is being marshaled as L"\xd83c\xdca1\xd83c\xdcae\xd83c\xdcad\xd83c\xdcab\xd83c\xdcaa"
A search of the source tree found references to -ORBNativeWcharCodeSet UCS-4 -ORBWcharCodesetTranslator WUCS4_UTF16_Factory in TAO/tests/CodeSets/simple/. Use of -ORBNativeWcharCodeSet UCS-4 produces an outcome where the string is represented as an array of wchar_t twice the length of the UTF-32 representation where each pair of elements are filled with the low 16 bits of the UTF-32 character followed by the high 16 bits of the UTF-32 character.
For example, 🂡🂮🂭🂫🂪 is being marshaled as L"¡\001®\001¬\001«\001ª\001"
(gdb) print /x wstrStreamName[0] $2 = 0xf0a1 (gdb) print /x wstrStreamName[1] $3 = 0x1 (gdb) print /x wstrStreamName[2] $4 = 0xf0ae (gdb) print /x wstrStreamName[3] $5 = 0x1 (gdb) print /x wstrStreamName[4] $6 = 0xf0ad (gdb) print /x wstrStreamName[5] $7 = 0x1 (gdb) print /x wstrStreamName[6] $8 = 0xf0ab (gdb) print /x wstrStreamName[7] $9 = 0x1 (gdb) print /x wstrStreamName[8] $10 = 0xf0aa (gdb) print /x wstrStreamName[9] $11 = 0x1 (gdb) print /x wstrStreamName[10] $12 = 0x0 (gdb) print /x wstrStreamName[11] $13 = 0x0
Character reference: https://en.wikipedia.org/wiki/Playing_cards_in_Unicode
Based on discussion content, we proceeded to attempt to use WUCS4_UTF16.cpp, though encountered issues. The associated PR addresses the issues.
Did you build ACE with uses_wchar
?
On a very recent FreeBSD with LLVM 16 I get build failures in ACEXML with zzip and ACEXML/common/ZipCharStream.cpp
- ACEXML_Char
is wchar_t
and zip/zzip libraries return ordinary (char) values that are not compatible with (ACEXML_Char).
(may be this is totally unrelated to what you are doing)
(may be this is totally unrelated to what you are doing)
It does seem to be unrelated, please open a new issue/discussion.