sqlite_orm
sqlite_orm copied to clipboard
Overhaul wide-string support
As already mentioned in Slack, the support for wide-character strings needs a rather complete overhaul and/or an explicit documentation of its features. As far as I understand it is broken on macos/Linux.
- First of all, usage of
sqlite_*_text16()
vs.wstring_convert<codecvt_utf8_utf16<wchar_t>>
is kind of intermixed. - There are only unit tests for a single code path: binding to a statement and extracting from a result set via
codecvt_utf8_utf16<wchar_t>
, but neither conversion from a column value nor calling a function or returning from it. [see test case] - Returning a string from a function is broken [
statement_binder<>::result()
]-
sqlite3_result_text16()
expects the number of bytes, not characters. [see 3rd parameter] -
sqlite3_result_text16()
should be instructed to copy the string usingSQLITE_TRANSIENT
), otherwise the resulting memory goes out of scope. [see 4th parameter]
-
- Expecting UTF-16 encoded strings is correct on Windows, but not on other operating systems like macos/Linux:
- On Windows, everything's working fine:
sizeof(wchar_t) == 2
(16-bit), and encoding is UTF-16. - On macos/linux:
sizeof(wchar_t) == 4
(32-bit):- Using
sqlite3_*_16()
functions is outrightly wrong. - Using
codecvt_utf8_utf16<>
is bad:- While it's not prohibited to use wchar_t for UTF-16, it easily leads to subtly unexpected behaviour: Because wchar_t is 32-bit, it usually carries UTF-32, not UTF-16.
- Passed UTF-32 strings are suddenly treated as UTF-16 by sqlite_orm/sqlite.
- Returned UTF-16 strings are suddenly treated as UTF-32 by the program.
- In any case,
[codecvt_utf8_utf16<>](https://en.cppreference.com/w/cpp/locale/codecvt_utf8_utf16)
expects UTF-16, no matter the sizeof wchar_t: "If Elem is a 32-bit type, one UTF-16 code unit will be stored in each 32-bit character of the output sequence.". I emphasize again that this isn't the regular expectation on macos/Linux.
- While it's not prohibited to use wchar_t for UTF-16, it easily leads to subtly unexpected behaviour: Because wchar_t is 32-bit, it usually carries UTF-32, not UTF-16.
- Using
- While we are at it, I'd like to see a separation of wide-string support from
SQLITE_ORM_OMITS_CODECVT
, if possible: One might want to be able to pass or return wide-strings from Windows API functions, even if not being able to serialize the statement.
- On Windows, everything's working fine:
One way of fixing the UTF-16 issue on macos/Linux quickly is by disabling UTF-16 unicode when not on Windows altogether. This might not even have any impact, given that UTF-8 is prevalent on those systems.
Yeah what you are saying is right. The only disadvantage related to your work was that we've broken UT in dev
by my oversight.
No worries, when I touch code, I always find those corner cases, no matter the library or framework :)
@trueqbit you're good!
@trueqbit can we close it?
No! Still needs to be addressed.
looks like we are facing universal text encoding issue. There is no universal text encoding in standard C++ library so we are unable to fix it now. But when universal encoding appears in standard C++ lib then we can add it. Also users can add customizations for wstring
using external libraries within their projects right now
What if we enable wchar_t
support only on Windows (which is UTF-16)? This is incredibly useful on Windows and I wouldn't be happy if it went away. SQLite itself supports UTF-16, and we could use SQLite UTF-16 APIs directly.
UTF-16 on other platforms is a delicate subject and would only be doable with a sane opt-in and only worth it if someone absolutely requests it.
Otherwise there are universal C++ conversion functions from UTF-32 [std::convert_utf8<char32_t>
] and there is even a native UTF-32 character type char32_t
. However, I would assign UTF-32 support to a different ticket.
-
convert_utf8
is deprecated in C++17. That is why it is used insqlite_orm
under dedicated compile flag - how do you know that Windows uses
wchar_t
for UTF-16? Is it just a practical finding or there is a doc with it?
- Oh yes you are right, I thought that only UTF-16 conversion using
codecvt_utf8_utf16
is deprecated. - Programming on Windows is not undefined behaviour 🤣 UTF-16 is used since Windows 2000 as a super-set of UCS-2 (see wikipedia article), the whole Win32 API is using UTF-16 as per Microsoft's documentation (see article about Unicode), and based on
wchar_t
.