fmt
fmt copied to clipboard
Feature request: better support for UTF-8 localized number formatting
Hi! Here's the problem. Given that char8_t
is kind of broken currently and still lacks a good standardized transcoding library, we're treating char
based strings as if they were UTF-8. We want to format localized numbers with these strings. The problem is that std::numpunct<char>::decimal_point()
and thousands_sep()
only return a char and so it's not possible for these to represent UTF-8 characters beyond the ASCII subset. Some locales use non-ASCII characters for these, an example would be de-CH, which uses U+2019 for the digit separator. What we want to do is somehow transcode the values from std::numpunct<wchar_t>
to UTF-8 and have libfmt use these instead. Using a custom formatter specialization and a wrapper type isn't really an option for us since we want this to be less-intrusive and not something user's of the library need to be concerned about. Fortunately, we do have a facade class for formatting localized strings, and this class owns the std::locale
object, so it's possible for us to do some pre-processing or post-processing on the input and output arguments respectively. However, pre-processing doesn't work that well for vformat
, since there's no way to convert format_args
to wformat_args
(and there probably shouldn't be, since that would end up increasing code-bloat by creating a link dependency between char
and wchar_t
formatters due to the type erase that is involved). So we're kind of stuck with post-processing.
Ideally, if libfmt (and perhaps eventually the standardized version) had an option to use the std::num_put<char>
facet instead of std::numpunct<char>
, we could solve the problem that way by providing a custom std::num_put<char>
facet that outputs the correct UTF-8 sequence. libfmt would then need to recognize a different localized number format specifier, perhaps uppercase N
instead of L
, to indicate that it should use std::num_put
instead of std::numpunct
.
Another option would be to eventually support char8_t
, char16_t
and char32_t
string formatting, and automatically transcode the values from std::numpunct<wchar_t>
to the target character encoding.
It also looks like it would be possible to specialize the internal detail::int_writer
or detail::arg_formatter
template classes for each of the integral and floating point types to override the default formatting behavior, but this isn't a solution that would be portable to other implementations of std::format
. So not really a valid solution for us.
Our current workaround that should work with the standardized std::format
is to detect when the decimal point or thousands separator are non-ASCII characters and override the std::locale
object with a custom std::numpunct<char>
facet. This facet uses ASCII control characters \x01
and \x02
for the decimal point and digit separator, since these aren't found in strings in any of our uses cases. We then do a post-processing pass on the formatted string to replace \x01
or \x02
with the correct UTF-8 octet sequence. It's definitely a hack, but it works.
It looks something like this:
class Localizer
{
public:
std::string VFormat(std::string_view fmtstr, fmt::format_args args) const
{
std::string result = fmt::vformat(locale_, fmtstr, args);
if (postProcess_)
DoPostProcess(result);
return result;
}
template <typename... Args>
std::string Format(std::string_view fmtstr, const Args&... args) const
{
return VFormat(fmtstr, fmt::make_format_args(args...));
}
// implementation also provides VFormatTo and FormatTo member functions similar to above
private:
void DoPostProcess(std::string& result) const
{
// substitute \x01 and \x02 in result with values from the replacements_ array
}
std::locale locale_;
bool postProcess_;
std::array<std::string, 2> replacements_;
// localized string tables, etc. etc.
};
What are your thoughts? Any better ideas? Is there any good way out of this quagmire?
On second thought, a different format specifier to indicate preference of std::num_put
instead of std::numpunct
is probably a bad idea. It still would nice if there was a standard, out-of-band way to tell it to use std::num_put
instead though.
I can certainly relate to your pain. Proper localization without proper UTF support still is a mirage for the most part. Sooner or later you will get bitten by reality and implicit assumptions like a single character equates a single code unit. I've noticed this with the de_CH locale just recently during my attempt to serve our Swiss customers better. Experiences like these are my main motivation to refuse any meaningful string handling using char
-based strings. So there you go: either do string conversions between char
-based and wchar_t
-based strings wherever possibly needed (leaving litter all around) or simply stick with wchar_t
-based strings and live with the size and performance impact.
Just using wchar_t for the notion of thousands separator is still wrong. The type to represent a UTF character is string, as it can span arbitrary many code points.
Just using wchar_t for the notion of thousands separator is still wrong. The type to represent a UTF character is string, as it can span arbitrary many code points.
Yeah exactly, the real problem is the Standard library facets that return char_type for certain fields rather than string_type. Namely std::numpunct and std::moneypunct. That part of the library hasn't aged nearly as well. Perhaps what is needed is for someone to write a proposal to modernize those while maintaining backwards compatibility. I think it should be possible. Getting it approved, well that's a different story.
@DanielaE
either do string conversions between
char
-based andwchar_t
-based strings wherever possibly needed (leaving litter all around) or simply stick withwchar_t
-based strings and live with the size and performance impact.
Definitely the case today, but would be nice to get this fixed for the future.
a different format specifier to indicate preference of std::num_put instead of std::numpunct is probably a bad idea.
It is.
As Jonathan correctly pointed out wchar_t
doesn't solve the problem (and should generally be avoided for other reasons).
It might be possible to replace numpunct
with num_put
for locale-specific formatting in {fmt} although it better use something less trashy than ostreambuf_iterator
. A PR is welcome.
{fmt} now supports the UTF-8 format_facet
locale facet which, among other things, makes using multi-code-unit digit separators possible. For example:
#include <fmt/format.h>
#include <locale>
int main() {
std::locale::global(std::locale({}, new fmt::format_facet<std::locale>("’")));
fmt::print("{:L}\n", 1000);
}
prints:
1’000
Here ’
is U+2019 (\xe2\x80\x99
in UTF-8).