sqlite_orm icon indicating copy to clipboard operation
sqlite_orm copied to clipboard

Overhaul wide-string support

Open trueqbit opened this issue 2 years ago • 9 comments

As already mentioned in Slack, the support for wide-character strings needs a rather complete overhaul and/or an explicit documentation of its features. As far as I understand it is broken on macos/Linux.

  • First of all, usage of sqlite_*_text16() vs. wstring_convert<codecvt_utf8_utf16<wchar_t>> is kind of intermixed.
  • There are only unit tests for a single code path: binding to a statement and extracting from a result set via codecvt_utf8_utf16<wchar_t>, but neither conversion from a column value nor calling a function or returning from it. [see test case]
  • Returning a string from a function is broken [statement_binder<>::result()]
    • sqlite3_result_text16() expects the number of bytes, not characters. [see 3rd parameter]
    • sqlite3_result_text16() should be instructed to copy the string using SQLITE_TRANSIENT), otherwise the resulting memory goes out of scope. [see 4th parameter]
  • Expecting UTF-16 encoded strings is correct on Windows, but not on other operating systems like macos/Linux:
    • On Windows, everything's working fine: sizeof(wchar_t) == 2 (16-bit), and encoding is UTF-16.
    • On macos/linux: sizeof(wchar_t) == 4 (32-bit):
      • Using sqlite3_*_16() functions is outrightly wrong.
      • Using codecvt_utf8_utf16<> is bad:
        • While it's not prohibited to use wchar_t for UTF-16, it easily leads to subtly unexpected behaviour: Because wchar_t is 32-bit, it usually carries UTF-32, not UTF-16.
          • Passed UTF-32 strings are suddenly treated as UTF-16 by sqlite_orm/sqlite.
          • Returned UTF-16 strings are suddenly treated as UTF-32 by the program.
          • In any case, [codecvt_utf8_utf16<>](https://en.cppreference.com/w/cpp/locale/codecvt_utf8_utf16) expects UTF-16, no matter the sizeof wchar_t: "If Elem is a 32-bit type, one UTF-16 code unit will be stored in each 32-bit character of the output sequence.". I emphasize again that this isn't the regular expectation on macos/Linux.
    • While we are at it, I'd like to see a separation of wide-string support from SQLITE_ORM_OMITS_CODECVT, if possible: One might want to be able to pass or return wide-strings from Windows API functions, even if not being able to serialize the statement.

One way of fixing the UTF-16 issue on macos/Linux quickly is by disabling UTF-16 unicode when not on Windows altogether. This might not even have any impact, given that UTF-8 is prevalent on those systems.

trueqbit avatar Mar 08 '22 17:03 trueqbit

Yeah what you are saying is right. The only disadvantage related to your work was that we've broken UT in dev by my oversight.

fnc12 avatar Mar 08 '22 17:03 fnc12

No worries, when I touch code, I always find those corner cases, no matter the library or framework :)

trueqbit avatar Mar 08 '22 18:03 trueqbit

@trueqbit you're good!

fnc12 avatar Mar 08 '22 18:03 fnc12

@trueqbit can we close it?

fnc12 avatar Mar 27 '22 04:03 fnc12

No! Still needs to be addressed.

trueqbit avatar Mar 27 '22 08:03 trueqbit

looks like we are facing universal text encoding issue. There is no universal text encoding in standard C++ library so we are unable to fix it now. But when universal encoding appears in standard C++ lib then we can add it. Also users can add customizations for wstring using external libraries within their projects right now

fnc12 avatar Mar 27 '22 11:03 fnc12

What if we enable wchar_t support only on Windows (which is UTF-16)? This is incredibly useful on Windows and I wouldn't be happy if it went away. SQLite itself supports UTF-16, and we could use SQLite UTF-16 APIs directly. UTF-16 on other platforms is a delicate subject and would only be doable with a sane opt-in and only worth it if someone absolutely requests it.

Otherwise there are universal C++ conversion functions from UTF-32 [std::convert_utf8<char32_t>] and there is even a native UTF-32 character type char32_t. However, I would assign UTF-32 support to a different ticket.

trueqbit avatar Mar 27 '22 12:03 trueqbit

  1. convert_utf8 is deprecated in C++17. That is why it is used in sqlite_orm under dedicated compile flag
  2. how do you know that Windows uses wchar_t for UTF-16? Is it just a practical finding or there is a doc with it?

fnc12 avatar Mar 27 '22 12:03 fnc12

  1. Oh yes you are right, I thought that only UTF-16 conversion using codecvt_utf8_utf16 is deprecated.
  2. Programming on Windows is not undefined behaviour 🤣 UTF-16 is used since Windows 2000 as a super-set of UCS-2 (see wikipedia article), the whole Win32 API is using UTF-16 as per Microsoft's documentation (see article about Unicode), and based on wchar_t.

trueqbit avatar Mar 27 '22 14:03 trueqbit