toml11 icon indicating copy to clipboard operation
toml11 copied to clipboard

Support C++20 and std::u8string

Open jwillikers opened this issue 4 years ago • 8 comments

When compiling for C++20, the following error occurs:

../tests/test_parse_unicode.cpp:51:23: error: no matching conversion for functional-style cast from 'const char8_t [53]' to 'std::string' (aka 'basic_string<char, char_traits<char>, allocator<char> >')
                      std::string(u8"Ýôú'ℓℓ λáƭè ₥è áƒƭèř ƭλïƨ - #"));

It looks like the introduction of std::u8string is causing problems for conversions between char8_t and std::string types.

I'm not sure the best way to handle this. My first though is to create a type alias which can be configured to std::string for C++11, C++14, and C++17 or std::u8string for C++20 and newer. That brings up an important question. Should the API for toml11 only support std::u8string for C++20 and beyond?

jwillikers avatar Feb 19 '20 15:02 jwillikers

Yes, I know that problem... The other day, I did the same thing as you did and encountered the same error. I'm also not sure what is the best way to deal with it. Anyway, thank you for reporting this. The priority increased.

There can be several options. One is, as you suggested, to add a type alias to switch the implementation of toml::string from std::string to std::u8string. In this way, the users do not need to mind about the character type used, but combining it with no-u8string (i.e. existing) code in c++20 mode could become a bit harder. Another is to add a template parameter to toml::value to give the users a choice. We can choose which one to use in the user code, but the templatized code would become messy. The most ad-hoc solution is to convert char8_t literal to std::string in the test codes byte by byte, but it does not solve the fundamental problem.

Basically, I want to provide users the flexibility and controllability. So I prefer the second option in the previous paragraph, template. But currently, I've not done anything about this because the priority was low. Also, since I recognized the problem only a few days ago, I'm still not so confident about the solution. There could be another, better idea, not sure...

ToruNiina avatar Feb 19 '20 17:02 ToruNiina

I dug up some information on this and it looks like nobody is happy about the breaking conversions for std::u8string and char8_t. It looks like several built-in types are missing proper specializations for u8 types in C++20.

  1. {fmt} issue on char8_t support - https://github.com/fmtlib/fmt/issues/1405
  2. StackOverflow answer about C++ u8 conversions: https://stackoverflow.com/a/59055485/9835303
  3. Proposal to not using std::u8string or char8_t: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1747r0.html
  4. PR to fix std::u8string usage in PyBind: https://github.com/pybind/pybind11/pull/2026

jwillikers avatar Feb 19 '20 21:02 jwillikers

Would there be a way to also support reading to wstring instead of string, and serializing from wstring as UTF-8?

levicki avatar Jul 10 '20 13:07 levicki

Sorry for the late response. But we don't have a plan to serialize into/deserialize from wstring. Actually, wchar_t is an implementation-defined character and the internal representation of wchar_t is not guaranteed to be Unicode (it could be a local character encoding format). Even if the environment uses Unicode, the encoding format of wchar_t might not be utf-8, but utf-16 (e.g., windows) or utf-32 (e.g., linux). Since TOML standard says TOML data should be encoded in the utf-8 format, we can focus on char(the traditional way of handling byte arrays) and char8_t.

You can use compiler's builtin or OS API for convertion between an array of wchar_t and a utf-8 byte buffer. <codecvt> could be another option, but note that codecvt_utf8 is deprecated since C++17.

ToruNiina avatar Sep 20 '20 10:09 ToruNiina

No problem.

I have found much better C++ parser for TOML in the meantime, which supports all conversions from/to STL containers, and whose author is more open-minded when it comes to feature requests which would make their library useful for more people.

levicki avatar Sep 20 '20 10:09 levicki

Nice. Most of the libraries are provided as is and toml11 is no exception. I hope you could solve your problem. The implementation of new features might take some time, and I don't always have time. But pull requests for new features are always welcome.

ToruNiina avatar Sep 20 '20 13:09 ToruNiina

Coming back to the original problem, I have added a workaround and now both ""_toml and u8""_toml literal works in C++20 mode in the current release. Now CI contains test cases with C++20 mode using several famous compilers. It seems that all the features work in C++20.

And thank you very much jwillikers for the surveying the situation. Currently u8string is still not supported, but I will later implement the conversion from std::u8string via get and find and conversion to toml::value. That means that a normal std::string will be used as an internal string representation and we would not be able to get a raw reference to u8string, but I think it is a good compromise in the current situation. Adding many ifdefs makes the code complicated.

ToruNiina avatar Sep 20 '20 15:09 ToruNiina

Most of the libraries are provided as is and toml11 is no exception.

I understand that very well, the only reason I ever asked about std::wstring support is because it is part of C++ STL, and it is kind of unavoidable to use std::wstring and the underlying wchar_t if you want to do any C++ coding on Windows.

I also understand that wchar_t is not the same size on Linux / mac OS, and that char there usually means UTF-8, so if you wrote your library with those operating systems in mind it is clear why you would refuse to support wchar_t and std::wstring.

I hope you could solve your problem.

Yes, I have solved it by switching to toml++.

The implementation of new features might take some time, and I don't always have time. But pull requests for new features are always welcome.

I understand that as well. However, people sometimes need to get their own work done too. That's usually why they look for a library someone else wrote in the first place -- to avoid having to implement stuff in a domain they aren't familiar with under time constraints of their own project or work assignment.

Sorry for the slight off-topic, and I apologize if I came through as disrespectful with my previous response.

levicki avatar Sep 20 '20 16:09 levicki