STL `<cuchar>`: `mbrtoc8()` and `c8rtomb()` are not yet implemented

WG21-N4892 [cuchar.syn] specifies:

size_t mbrtoc8(char8_t* pc8, const char* s, size_t n, mbstate_t* ps);
size_t c8rtomb(char* s, char8_t c8, mbstate_t* ps);

These functions are not yet implemented in the UCRT, so they aren't mentioned in our <cuchar>:

https://github.com/microsoft/STL/blob/d6f9987d7ec694b8599fe21e87e3653c36566223/stl/inc/cuchar#L21-L28

@amyw-msft filed internal task OS-25681113 "Implement mbrtoc8, c8rtomb" to implement them in the UCRT. Until then, changes to <cuchar> are blocked.

Sep 16 '21 23:09 StephanTLavavej

WG14 adopted N2653 (char8_t: A type for UTF-8 characters and strings (Revision 1)) for C23 during their January/February 2022 meeting as indicated by the minutes in N2991. The wording changes are present in N2912.

All that to say, if implementation in the UCRT was awaiting WG14 approval, that is no longer a blocking concern.

Aug 03 '22 05:08 tahonermann

@StephanTLavavej - It's now been 5 years since standardization and currently there does not appear to be any conforming way to convert UTF to locale-specific multibyte using the C++ standard library only. (I know it's possible using MultiByteToWideChar/WideCharToMultiByte directly as the standard library is doing).

wcrtomb/mbrtowc are simply broken for anything outside the Unicode BMP because they just die with EILSEQ if the character doesn't fit in UCS-2 (see the relevant source code here).

c16rtomb/mbrtoc16/c32rtomb/mbrtoc32 are unusable because Microsoft has a non-conforming implementation that causes them to always use UTF-8, instead of the current locale (see this question and this article on MS documentation).

If c8rtomb/mbrtoc8 were implemented, presumably they would actually use the current locale instead of forcing to locale-independent UTF-8, which would be rather redundant and a no-op.

Any updates on this?

Jan 19 '25 04:01 owacoder

wcrtomb/mbrtowc are simply broken for anything outside the Unicode BMP because they just die with EILSEQ if the character doesn't fit in UCS-2

Just curious, can you name an encoding, supported by UCRT and MSVC STL, that can represent characters outside the BMP?

I know that GB18030 can, but it's not supported by UCRT.

Jan 19 '25 14:01 cpplearner

UTF-8 supports characters outside the BMP. The issue is that there is no single interface that allows conversion from full Unicode (including outside the BMP) to both UTF-8 and legacy codepages based on the current locale setting as the standard dictates. c32rtomb/mbrtoc32 would work excellently here if they were standards-conforming and used the current locale.

Suppose that you have a UTF-32 codepoint that you want to convert to the current multibyte locale (which could be UTF-8, or may not be). It's true that the codepage you're converting to may not support that codepoint, but which interface should be used to do the conversion? That's the main question.

Jan 19 '25 14:01 owacoder

No updates, sorry. They keep finding more work for us (UCRT/STL maintainers) to do without adding more maintainers.

Jan 19 '25 17:01 StephanTLavavej

Thanks for the quick response! Do you know if the above analysis regarding no standard way to convert full Unicode to multibyte is correct? And do we expect that MSVC will conform the c16rtomb/mbrtoc16/c32rtomb/mbrtoc32 implementations to the standard in the future too? Right now it's just said they're fixed to UTF-8 "for compatibility reasons."

Jan 19 '25 21:01 owacoder

Do you know if the above analysis regarding no standard way to convert full Unicode to multibyte is correct?

I'm not deeply familiar with that part of the Standard. (I've only been here 18 years, I don't know everything yet :joy_cat:)

And do we expect that MSVC will conform the c16rtomb/mbrtoc16/c32rtomb/mbrtoc32 implementations to the standard in the future too? Right now it's just said they're fixed to UTF-8 "for compatibility reasons."

It's possible, but because it's the UCRT and not the STL, I wouldn't expect it to change soon (where soon is "within a decade").

Jan 19 '25 21:01 StephanTLavavej

@StephanTLavavej, is there a way for someone from outside of Microsoft to offer contributions for the UCRT? These functions are not particularly difficult to implement. I could at least provide a reference implementation.

Jan 20 '25 22:01 tahonermann

No, they aren't set up to accept external contributions.

Jan 20 '25 22:01 StephanTLavavej

STL STL copied to clipboard

`<cuchar>`: `mbrtoc8()` and `c8rtomb()` are not yet implemented

STL
STL copied to clipboard