globalize
globalize copied to clipboard
Bug: Globalize number formatter is incorrect for numeric digits in supplemental plane
Hi there
globalise (v1.7.0) number formatting is incorrect for cldr-data (v36.0.0), when cldr numeric digits are from the UTF-16 supplemental plane (from U+010000 to U+10FFFF).
Short example, discussed below: 44.56 formatted in ccp locale
- Should be: "𑄺𑄺.𑄻𑄼" = ["1113a", "1113a", "2e", "1113b", "1113c"] (hex codepoints)
- But returned by globalise: "��.��" = [ 'd804', 'd804', '2e', 'dd38', 'd804' ]
Based on the formatted value returned by globalise, I initially suspected that individual characters are somehow being represented in globalize as surrogate pairs (so two 16-bit hex values), but only the first of these hex values is returned. There's a worked example below, except I now have some doubts over this theory: for the 4 numeric digits involved, 3 of the digits returned by globalize seem to be the first half of a surrogate pair, but one isn't.
Example (no code)
For the "ccp" locale, digitals 0-9 are "𑄶𑄷𑄸𑄹𑄺𑄻𑄼𑄽𑄾𑄿", which have unicode hex codepoints of ["11136", "11137", "11138", "11139", "1113a", "1113b", "1113c", "1113d", "1113e", "1113f"].
So the number 44.56 formatted in ccp should be "𑄺𑄺.𑄻𑄼" = ["1113a", "1113a", "2e", "1113b", "1113c"]
What is actually returned from globalise is "��.��" = [ 'd804', 'd804', '2e', 'dd38', 'd804' ]
Using the Surrogate Pair Calculator for the individual characters in "𑄺𑄺.𑄻𑄼" = ["1113a", "1113a", "2e", "1113b", "1113c"]
- 1113a = D804 + DD3A
- 1113a = D804 + DD3A
- 2e = 2e (no pair needed)
- 1113b = D804 + DD3B (but globalise actually returns dd38)
- 1113c = D804 + DD3C
So maybe globalise is returning the first hex value from each surrogate pair? But dd38 is returned, not D804 (for 1113b)
Example (code)
// Output hex values for Javascript unicode characters
var asUnicodePoints = function(value) {
return Array.from(value).map(function(codePoint) {
return codePoint.codePointAt(0).toString(16);
});
};
// For us locale, works fine
var result = Globalize('us').numberFormatter()(44.56);
console.log(result);
=> 44.56
console.log(asUnicodePoints(result));
=> [ '34', '34', '2e', '35', '36' ]
// For cpp locale, wrongly returns first hex value from each surrogate pair?
var result = Globalize('ccp').numberFormatter()(44.56);
console.log(result);
=> ��.��
console.log(asUnicodePoints(result));
=> [ 'd804', 'd804', '2e', 'dd38', 'd804' ]
// For ccp locale, the true hex values for formatted 44.56 should be..
console.log(asUnicodePoints("𑄺𑄺.𑄻𑄼"));
=> [ '1113a', '1113a', '2e', '1113b', '1113c' ]
Thanks for filing the issue and your detailed debugging. I am open to accept a fix. Thanks!
@rxaviers I'll see what I can do. Any guidance on roughly where in the code I should be looking?
Awesome. Numbering system digits are set at https://github.com/globalizejs/globalize/blob/master/src/number/numbering-system-digits-map.js, stored as formatter properties at https://github.com/globalizejs/globalize/blob/master/src/number/format-properties.js#L63, then used here https://github.com/globalizejs/globalize/blob/master/src/number/format.js#L96. Their respective unit tests can be found https://github.com/globalizejs/globalize/blob/master/test/unit/number/format-properties.js and https://github.com/globalizejs/globalize/blob/master/test/unit/number/format.js.
OK, this issue isn't going to be my highest priority, though I will hopefully get round to it at some point. I believe the issue only affects 4 locales, all related to the base ccp locale: ccp, ccp-u-nu-native, ccp-IN and ccp-IN-u-nu-native.