cpp_weekly icon indicating copy to clipboard operation
cpp_weekly copied to clipboard

String (and path) encoding, terminal in/out

Open jmarrec opened this issue 2 years ago • 1 comments

Channel

C++Weekly

Topics

In the unix world, all is well and your terminal uses utf-8 encoding, std::filesystem::path uses char.

On windows things go sideways real quick: if you're reading command line parameters with non-ascii chars you probably need to use wmain, then narrow to std::string. codecvt header was deprecated at c++17, googling/stackoverflow is not helpful because the vast majority of what you find still relies on it. fs::path uses wchar_t too.

Length

Probably long form, 10-20min.

Note: should this be picked up as a strong candidate, I'm willing to contribute to the episode preparation. I can provide an example use case, some test code, potential solutions, etc.

jmarrec avatar Jul 06 '23 18:07 jmarrec

On Windows with NTFS, file and directory names are just a series of 16-bit integers. There is no requirement for them to be valid Unicode/UTF-16, so it's not always possible to correctly round-trip them through UTF-8. This is why std::filesystem::path uses wchar_t on Windows as its native representation.

However, file and directory names that can't be converted to UTF-8 are rare, and could be considered a bug if ever generated. Microsoft seems to be taking this approach, since they now encourage developers to use UTF-8 and the A APIs instead of the W APIs. Internally, the A APIs just convert automatically (and with more optimized code than you could write yourself) and in practice they don't error on invalid Unicode, they just convert it in a lossy fashion.

LB-- avatar Jul 14 '23 02:07 LB--