fmt icon indicating copy to clipboard operation
fmt copied to clipboard

Feature request: better support for UTF-8 localized number formatting

Open jwtowner opened this issue 4 years ago • 6 comments

Hi! Here's the problem. Given that char8_t is kind of broken currently and still lacks a good standardized transcoding library, we're treating char based strings as if they were UTF-8. We want to format localized numbers with these strings. The problem is that std::numpunct<char>::decimal_point() and thousands_sep() only return a char and so it's not possible for these to represent UTF-8 characters beyond the ASCII subset. Some locales use non-ASCII characters for these, an example would be de-CH, which uses U+2019 for the digit separator. What we want to do is somehow transcode the values from std::numpunct<wchar_t> to UTF-8 and have libfmt use these instead. Using a custom formatter specialization and a wrapper type isn't really an option for us since we want this to be less-intrusive and not something user's of the library need to be concerned about. Fortunately, we do have a facade class for formatting localized strings, and this class owns the std::locale object, so it's possible for us to do some pre-processing or post-processing on the input and output arguments respectively. However, pre-processing doesn't work that well for vformat, since there's no way to convert format_args to wformat_args (and there probably shouldn't be, since that would end up increasing code-bloat by creating a link dependency between char and wchar_t formatters due to the type erase that is involved). So we're kind of stuck with post-processing.

Ideally, if libfmt (and perhaps eventually the standardized version) had an option to use the std::num_put<char> facet instead of std::numpunct<char>, we could solve the problem that way by providing a custom std::num_put<char> facet that outputs the correct UTF-8 sequence. libfmt would then need to recognize a different localized number format specifier, perhaps uppercase N instead of L, to indicate that it should use std::num_put instead of std::numpunct.

Another option would be to eventually support char8_t, char16_t and char32_t string formatting, and automatically transcode the values from std::numpunct<wchar_t> to the target character encoding.

It also looks like it would be possible to specialize the internal detail::int_writer or detail::arg_formatter template classes for each of the integral and floating point types to override the default formatting behavior, but this isn't a solution that would be portable to other implementations of std::format. So not really a valid solution for us.

Our current workaround that should work with the standardized std::format is to detect when the decimal point or thousands separator are non-ASCII characters and override the std::locale object with a custom std::numpunct<char> facet. This facet uses ASCII control characters \x01 and \x02 for the decimal point and digit separator, since these aren't found in strings in any of our uses cases. We then do a post-processing pass on the formatted string to replace \x01 or \x02 with the correct UTF-8 octet sequence. It's definitely a hack, but it works.

It looks something like this:

class Localizer
{
public:
    std::string VFormat(std::string_view fmtstr, fmt::format_args args) const
    {
       std::string result = fmt::vformat(locale_, fmtstr, args);
       if (postProcess_)
           DoPostProcess(result);
       return result;
    }

    template <typename... Args>
    std::string Format(std::string_view fmtstr, const Args&... args) const
    {
        return VFormat(fmtstr, fmt::make_format_args(args...));
    }

    // implementation also provides VFormatTo and FormatTo member functions similar to above

private:
    void DoPostProcess(std::string& result) const
    {
        // substitute \x01 and \x02 in result with values from the replacements_ array
    }

    std::locale locale_;
    bool postProcess_;
    std::array<std::string, 2> replacements_;
    // localized string tables, etc. etc.
};

What are your thoughts? Any better ideas? Is there any good way out of this quagmire?

jwtowner avatar Sep 05 '20 02:09 jwtowner

On second thought, a different format specifier to indicate preference of std::num_put instead of std::numpunct is probably a bad idea. It still would nice if there was a standard, out-of-band way to tell it to use std::num_put instead though.

jwtowner avatar Sep 05 '20 20:09 jwtowner

I can certainly relate to your pain. Proper localization without proper UTF support still is a mirage for the most part. Sooner or later you will get bitten by reality and implicit assumptions like a single character equates a single code unit. I've noticed this with the de_CH locale just recently during my attempt to serve our Swiss customers better. Experiences like these are my main motivation to refuse any meaningful string handling using char-based strings. So there you go: either do string conversions between char-based and wchar_t-based strings wherever possibly needed (leaving litter all around) or simply stick with wchar_t-based strings and live with the size and performance impact.

DanielaE avatar Sep 06 '20 06:09 DanielaE

Just using wchar_t for the notion of thousands separator is still wrong. The type to represent a UTF character is string, as it can span arbitrary many code points.

foonathan avatar Sep 06 '20 07:09 foonathan

Just using wchar_t for the notion of thousands separator is still wrong. The type to represent a UTF character is string, as it can span arbitrary many code points.

Yeah exactly, the real problem is the Standard library facets that return char_type for certain fields rather than string_type. Namely std::numpunct and std::moneypunct. That part of the library hasn't aged nearly as well. Perhaps what is needed is for someone to write a proposal to modernize those while maintaining backwards compatibility. I think it should be possible. Getting it approved, well that's a different story.

jwtowner avatar Sep 07 '20 09:09 jwtowner

@DanielaE

either do string conversions between char-based and wchar_t-based strings wherever possibly needed (leaving litter all around) or simply stick with wchar_t-based strings and live with the size and performance impact.

Definitely the case today, but would be nice to get this fixed for the future.

jwtowner avatar Sep 07 '20 11:09 jwtowner

a different format specifier to indicate preference of std::num_put instead of std::numpunct is probably a bad idea.

It is.

As Jonathan correctly pointed out wchar_t doesn't solve the problem (and should generally be avoided for other reasons).

It might be possible to replace numpunct with num_put for locale-specific formatting in {fmt} although it better use something less trashy than ostreambuf_iterator. A PR is welcome.

vitaut avatar Sep 20 '20 16:09 vitaut

{fmt} now supports the UTF-8 format_facet locale facet which, among other things, makes using multi-code-unit digit separators possible. For example:

#include <fmt/format.h>
#include <locale>

int main() {
  std::locale::global(std::locale({}, new fmt::format_facet<std::locale>("’")));
  fmt::print("{:L}\n", 1000);
}

prints:

1’000

Here is U+2019 (\xe2\x80\x99 in UTF-8).

vitaut avatar Sep 03 '22 18:09 vitaut