sqlite_orm Overhaul wide-string support

As already mentioned in Slack, the support for wide-character strings needs a rather complete overhaul and/or an explicit documentation of its features. As far as I understand it is broken on macos/Linux.

First of all, usage of sqlite_*_text16() vs. wstring_convert<codecvt_utf8_utf16<wchar_t>> is kind of intermixed.
There are only unit tests for a single code path: binding to a statement and extracting from a result set via codecvt_utf8_utf16<wchar_t>, but neither conversion from a column value nor calling a function or returning from it. [see test case]
Returning a string from a function is broken [statement_binder<>::result()]
- sqlite3_result_text16() expects the number of bytes, not characters. [see 3rd parameter]
- sqlite3_result_text16() should be instructed to copy the string using SQLITE_TRANSIENT), otherwise the resulting memory goes out of scope. [see 4th parameter]
Expecting UTF-16 encoded strings is correct on Windows, but not on other operating systems like macos/Linux:
- On Windows, everything's working fine: sizeof(wchar_t) == 2 (16-bit), and encoding is UTF-16.
- On macos/linux: sizeof(wchar_t) == 4 (32-bit):
  - Using sqlite3_*_16() functions is outrightly wrong.
  - Using codecvt_utf8_utf16<> is bad:
    - While it's not prohibited to use wchar_t for UTF-16, it easily leads to subtly unexpected behaviour: Because wchar_t is 32-bit, it usually carries UTF-32, not UTF-16.
      - Passed UTF-32 strings are suddenly treated as UTF-16 by sqlite_orm/sqlite.
      - Returned UTF-16 strings are suddenly treated as UTF-32 by the program.
      - In any case, [codecvt_utf8_utf16<>](https://en.cppreference.com/w/cpp/locale/codecvt_utf8_utf16) expects UTF-16, no matter the sizeof wchar_t: "If Elem is a 32-bit type, one UTF-16 code unit will be stored in each 32-bit character of the output sequence.". I emphasize again that this isn't the regular expectation on macos/Linux.
- While we are at it, I'd like to see a separation of wide-string support from SQLITE_ORM_OMITS_CODECVT, if possible: One might want to be able to pass or return wide-strings from Windows API functions, even if not being able to serialize the statement.

One way of fixing the UTF-16 issue on macos/Linux quickly is by disabling UTF-16 unicode when not on Windows altogether. This might not even have any impact, given that UTF-8 is prevalent on those systems.

Mar 08 '22 17:03 trueqbit

Yeah what you are saying is right. The only disadvantage related to your work was that we've broken UT in dev by my oversight.

Mar 08 '22 17:03 fnc12

No worries, when I touch code, I always find those corner cases, no matter the library or framework :)

Mar 08 '22 18:03 trueqbit

@trueqbit you're good!

Mar 08 '22 18:03 fnc12

@trueqbit can we close it?

Mar 27 '22 04:03 fnc12

No! Still needs to be addressed.

Mar 27 '22 08:03 trueqbit

looks like we are facing universal text encoding issue. There is no universal text encoding in standard C++ library so we are unable to fix it now. But when universal encoding appears in standard C++ lib then we can add it. Also users can add customizations for wstring using external libraries within their projects right now

Mar 27 '22 11:03 fnc12

What if we enable wchar_t support only on Windows (which is UTF-16)? This is incredibly useful on Windows and I wouldn't be happy if it went away. SQLite itself supports UTF-16, and we could use SQLite UTF-16 APIs directly. UTF-16 on other platforms is a delicate subject and would only be doable with a sane opt-in and only worth it if someone absolutely requests it.

Otherwise there are universal C++ conversion functions from UTF-32 [std::convert_utf8<char32_t>] and there is even a native UTF-32 character type char32_t. However, I would assign UTF-32 support to a different ticket.

Mar 27 '22 12:03 trueqbit

convert_utf8 is deprecated in C++17. That is why it is used in sqlite_orm under dedicated compile flag
how do you know that Windows uses wchar_t for UTF-16? Is it just a practical finding or there is a doc with it?

Mar 27 '22 12:03 fnc12

Oh yes you are right, I thought that only UTF-16 conversion using codecvt_utf8_utf16 is deprecated.
Programming on Windows is not undefined behaviour 🤣 UTF-16 is used since Windows 2000 as a super-set of UCS-2 (see wikipedia article), the whole Win32 API is using UTF-16 as per Microsoft's documentation (see article about Unicode), and based on wchar_t.

Mar 27 '22 14:03 trueqbit

sqlite_orm sqlite_orm copied to clipboard

Overhaul wide-string support

sqlite_orm
sqlite_orm copied to clipboard