luxon icon indicating copy to clipboard operation
luxon copied to clipboard

Incorrect months for Japanese

Open andrewplummer opened this issue 6 years ago • 26 comments

Doing this:

import { Info } from 'luxon';
Info.months('short', { locale: 'ja-JP' });

I would expect a result of ["1月", "2月", "3月", "4月", "5月", "6月", "7月", "8月", "9月", "10月", "11月", "12月"] , however the result is ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"].

new Intl.DateTimeFormat('ja-JP', { month: 'short'}).formatToParts() produces two array entries... Luxon appears to be leaving off the literal part.

Environment is Chrome for OSX 76.0.3809.87

andrewplummer avatar Aug 06 '19 08:08 andrewplummer

Same seems to be true for long and other formats as well.

andrewplummer avatar Aug 06 '19 08:08 andrewplummer

Noticing a similar issue with times... compare:

DateTime.fromJSDate(new Date()).setLocale('ja-JP').toLocaleString(DateTime.TIME_SIMPLE)

with

new Intl.DateTimeFormat('ja-JP', {hour:'numeric', minutes:'2-digit'}).format(new Date())

Interestingly Intl is leaving the minutes off but in any case I would expect them to match up.

andrewplummer avatar Aug 06 '19 08:08 andrewplummer

Yes, I think this is a bug. It looks like what's happening is that Luxon is still telling Intl to use the Latin numbering system, whereas it should be letting Intl decide that on its own. I'll have to trace through it.

icambron avatar Aug 15 '19 16:08 icambron

Awesome! Thanks for the quick response!

andrewplummer avatar Aug 16 '19 00:08 andrewplummer

This didn't turn out to be what I thought it would be at all.

First, just to simplify things a bit: there's no issue with toLocaleString; it's just that TIME_SIMPLE renders the same way in Japanese because they really just do hh:mm without the symbol for hour in there. The reason your cross-reference with the Intl API isn't working is the same as why it left the minutes off: you wanted minute, not minutes, so it was it was saying "6 hours" instead of "6:32".

But back to the month lister. Here is basically how it works:

const formatted = new Intl.DateTimeFormat('ja-JP', {month: "short"}).formatToParts();
//=> [ { type: 'month', value: '9' }, { type: 'literal', value: '月' } ]

const  matching = results.find(m => m.type === "months"); //=> "9"

The reason for parsing out the month value is we don't know whether the locale is willing to give us the month by itself or if it will insist on adding other stuff around it that we don't want. (This may have improved in browser implementations since I wrote all this. Certainly at the time it was a mess, and doubly so for the Intl polyfill.) To make matters worse, we also support listing the "format" versions, where we have to explicitly pass { month: "short", day: "numeric" } to Intl to see if it uses different values for months for some grammatical reason (e.g. for Russian); that's actually why we have monthsFormat.

Unfortunately, for Japanese, we do want the stuff Intl thinks is extra. I wish the implementers of this had made "4月" the value and not a literal, since it's actually just what the short month is called.

To fix this, I'd need to:

  1. Trust that when I pass in { month: "short" } that I can use the whole answer across every result.
  2. Have some very strange asymmetry with the format versions of the listers

I don't know if I can do 1, and 2 sounds really gross. This will need some more thought and research.

icambron avatar Sep 22 '19 10:09 icambron

Ok first confirmed that for the time example I needed minute not minutes, so that's working fine.

For the months, can you give an example of Intl "adding other stuff around it that we don't want"? I'm not quite sure what you mean. Also I'm not sure exactly what the "format versions" are.

andrewplummer avatar Sep 22 '19 10:09 andrewplummer

The Intl API doesn't guarantee that it will only return the thing you ask for. Here's an example from English:

Intl.DateTimeFormat("en-us", { era: "long"  }).format() //=> '9 22, 2019 Anno Domini'

I don't know if it ever happens for months specifically.

See Info.monthsFormat

icambron avatar Sep 22 '19 10:09 icambron

I see... that's definitely rough. It seems to be a bug on the DateTimeFormat side though. However, my hope would be that Luxon would pass those through as is (using all the parts even if they're incorrect) and allow application logic to handle the issues, as that would be preferable to masking data that can never be retrieved.

andrewplummer avatar Sep 22 '19 12:09 andrewplummer

I think the bug on the DateTimeFormat side is not including the symbol in the value portion. For example here is how the Intl polyfill handles it:

intlPoly.DateTimeFormat("ja-JP", { month: "short" }).formatToParts() //=> [ { type: 'month', value: '9月' } ]

...which means Luxon gets it right too.

The same code used for listers is also used for getting the tokens for formatters, so in your suggested scheme, if just took the results back on the assumption that it always provides the exact value, I'd end up using "9 22, 2019 Anno Domini" every time someone used the GG format token. So I can't be that flip about just taking strings as they come.

I don't have a great grasp on all the details of the Intl spec, but it's also not clear to me at all that this kind of behavior is actually a bug in Intl. For example, the spec does say the environment must support these combos:

  • weekday, year, month, day, hour, minute, second
  • weekday, year, month, day
  • year, month, day
  • year, month
  • month, day
  • hour, minute, second
  • hour, minute

...but not necessarily others, where it could choose to use a superset of what the caller asked for. I don't see where it says that it must directly support "month" (or "weekday" or etc) by itself. If there is such a rule (could be; again, I don't know the spec super well) and it's implemented accurately on the major browsers, I can put a condition on the units that work as expected and extract the strings via format() instead of formatToParts()

icambron avatar Sep 22 '19 21:09 icambron

I think the bug on the DateTimeFormat side is not including the symbol in the value portion.

Well I'm not familiar with the spec or how it's supposed to act (I'm sure it's huge), however formatToParts does return a "literal" token that seems to be meant to be taken as semantically distinct, so on its face the fact that it isn't included in the "value" doesn't really seem like a bug to me (in Japanese it's something like "dollars" that conveys the unit but not the value itself).

Luxon is it's own library so it doesn't necessarily have to match, but I would say Intl polyfill is definitely not correct in it's function as a polyfill if it's not matching native behavior... that is unless it actually is a bug.

so in your suggested scheme... I'd end up using "9 22, 2019 Anno Domini"...I can't be that flip about just taking strings as they come.

I realize that Luxon is a mature library so this will likely sound glib, but not only do I think this would be preferable but for me this is the expected behavior. I think the issue, one that happens with many other libraries, is that Luxon is trying to wrap Intl but also "fix" it at the same time. Fixing might be desirable but the problem is that there's no mechanism to opt out of this or intercept the behavior in the middle (unless I missed it?).

For me it would be much more valuable to have a "Principle of least astonishment" approach where I know exactly how things behave because Intl behaves the same. From there, polyfilling or patching broken behavior could be desirable, but only if there is an explicit spec that can be followed, and only if it is an explicit second step.

This is only an opinion of course, but if you don't take the tokens as Intl gives them to you then what would be a good fix here? The only other course I can see would be "patching the patch" which seems to be a decent indication that the approach might be incorrect.

Also I wonder if there isn't a better example than "era"... aside from being (arguably) not very useful, after checking both Chrome and FF on both OSX and Windows, they're all providing the same thing, which suggests to me maybe there's a reason for this (or there actually is a spec they're following here).

One approach that might also be clear is having default formatting tokens that follow Intl, but "patched" tokens that fix potential issues, for example GG as the Intl reported era but GGX or something to indicate something patched that doesn't follow native behavior. That would satisfy being able to "opt in" ... it doesn't have to be something long and verbose.

andrewplummer avatar Sep 22 '19 21:09 andrewplummer

But it's not semantically distinct, which is actually the bug you've filed here! The bug here is "I expect the short name of March in Japanese to be '3月'". If understand correctly, this is like calling March "Third Month", which makes the whole thing the name. The Intl implementation disagrees with you, saying that the name of the month is "Third" and that "Month" is a sort of extra label or a bit of punctuation, and thus Luxon would be right in returning "Third". literal is used for stuff like the : in 5:30. I suspect you are right about what the result should be and the Intl implementation is wrong, but I do think that's the right framing. So in my view, it's you who are recommending "patching" Intl.

It's not about following Intl or not following Intl, though, or about patching. Intl is not meant and has never been meant for things like this. It's a tool for formatting date strings, where you tell it what kinds of things you want in a date string and it writes a user-facing string that says them. If they wanted it to be an API for figuring out the names of months in different languages, they'd have made an API to ask for that. Again, as far as I know, Intl has no responsibility for responding to { era: "long"} with just the era string, which is why they don't bother; it's enough that they make a date string that includes it. Instead, what's happening here is that Luxon is abusing Intl to extract specific strings from it. So picking apart its not-really-for-this response is definitely Luxon's job.

Luxon is used by thousands of people with a huge range of use cases. This kind of thing happens a lot around here, and even more in Moment, where someone is right about something in the context of their particular use case, but wrong about the degree to which the things they don't care about matter to other people. I do not think that formatting GG as "9 22, 2019 Anno Domini" would strike the people who need GG as consistent with the principle of least astonishment when it shows up in their webpage.

If you can prove that for, say, months and weekdays, that all major implementations return the right thing for all locales, such that we can directly use the output of format() as the string, then I will change those particular ones. But I'm not willing to break things to fix Japanese month listings.

icambron avatar Sep 23 '19 00:09 icambron

I see... well the example you gave with : is interesting, and I think I can see your viewpoint here. However, I think the fundamental problem is that even though the spec is somewhat vague, the fact that a "literal" is an "extra label or a bit of punctuation" (that's as good of a way to put it as anything I've seen in the spec) doesn't mean that it isn't vital to the representation of the token and can be thrown away. Actually, : is itself a good example. If you ask for a format with both minutes and hours in English, it will return this literal, and this is vital for the representation of the time even though it isn't part of the actual value itself. Similarly, is vital for the representation of the month in Japanese, but it is a label and not part of the value, so in that case I still think it makes sense that it's semantically distinct.

I can see how this could be tricky, as if you were trying to format a date in Japanese, having this label or not would depend on the intended format, so you wouldn't necessarily want it all the time (they use formats like 2012-01-01 as well). But at the very least it can be said that it is vital for months as taken on their own, so for the case of something like Info.months or Info.weeks, it's hard to see how not keeping the literals wouldn't open the door to bugs (and indeed it seems to have).

Intl is not meant and has never been meant for things like this.... If they wanted it to be an API for figuring out the names of months in different languages, they'd have made an API to ask for that.

That's definitely true, and a bit surprising. However it seems to serve the purpose fine so far.

Instead, what's happening here is that Luxon is abusing Intl to extra specific strings from it. So picking apart it's not-really-for-this response is definitely Luxon's job.

Ok I see what you mean now. That's fair... to be clear, we're only talking about Info, right? Info doesn't somehow feed back into DateTime I take it? (Looks like it does below with the GG token... but what about otherwise?)

Luxon is used by thousands of people with a huge range of use cases. This kind of thing happens a lot around here, and even more in Moment, where someone is right about something in the context of their particular use case, but wrong about the degree to which the things they don't care about matter to other people.

Agreed, however in this case, at the moment, Info.months is critically failing at its fundamental use case. And it is a lot more of a use case than eras as well.

I do not think that formatting GG as "9 22, 2019 Anno Domini" would strike the people who need GG as consistent with the principle of least astonishment when it shows up in their webpage.

I see, so if I understand it right, the GG token is being fed by Info.eras which patches behavior because if you were to add { eras: 'long' } to the options object it would result in outputting the year and the month even if that's not what the format string asked for, is that correct? But no matter how you slice it this seems like inconsistent behavior with Intl specific to era, no? Is there a better example we can work with? Every other example I've seen returns just the information requested.

If you can prove that for, say, months and weekdays, that all major implementations return the right thing for all locales

They do. All CJK languages function like this so minimally I would suspect that the following locales are broken: ja-JP, ko-KR, zh-CN, zh-TW, zh-HK. Beyond that I'm not sure, CJK often has special cases so the issue might be localized to just these.

andrewplummer avatar Sep 23 '19 08:09 andrewplummer

zh-HK in Chrome actually has it as part of the month if you provide only month in the input:

new Intl.DateTimeFormat("zh-HK", { month: "long" }).formatToParts()
//=> [{type: "month", value: "12月"}]

new Intl.DateTimeFormat("zh", { month: "long", day: "numeric" }).formatToParts()
// => [{type: "month", value: "12"},
//=>   {type: "literal", value: "月"},
//=>   {type: "day", value: "10"},
//=>   {type: "literal", value: "日" }]

(That seems a little too subtle to me, but I see why they do it that way.)

That makes the lister in Luxon work like you expect:

Info.months('long', { locale: 'zh-HK' });
//=> ["1月", "2月", "3月", "4月", "5月", "6月", "7月", "8月", "9月", "10月", "11月", "12月"]

In other words, Luxon treating things that Intl claims are the actual month as the month and things it claims are not the month as not the month. Intl just has a bug for Japanese where it considers something that is not a literal as a literal.

At any rate, I did the analysis I needed to move forward: I ran through all the language codes in 639-1 checking for the month format, and in Chrome, only Japanese has a literal. In Safari and FF, Japanese is the same way and Dzongkha also has a prepended literal, but that appears to be a bug, because the literal is repeated in the month itself. So by hacking around the bug specifically for months, we could fix Japanese and break Dzongkha, which is probably the right tradeoff. I'd take a PR taking the whole value of month.

icambron avatar Dec 10 '19 07:12 icambron

ok well that seems like a decent workaround... although I'm still not sure I understand what's wrong with taking the literal as well... when it comes to formatToParts what's an example of a locale for which that would break something?

andrewplummer avatar Dec 10 '19 15:12 andrewplummer

What I'm suggesting is that for months, we are in the clear to do just that.

icambron avatar Dec 10 '19 16:12 icambron

Ah ok I see now... I might be a little while getting around to a PR but I'll try!

andrewplummer avatar Dec 10 '19 17:12 andrewplummer

Any update on this? Thanks.

ShahrukhAhmed89 avatar May 27 '24 07:05 ShahrukhAhmed89

Mostly a note to self, but maybe others will find this useful: I've read this thread and looked into the patch proposed by icambron and I've determined that the patch would occur in or near the formatToParts method in Locale.js, which calls on the Intl.dateTimeFormatter.formatToParts browser-defined method and then modifies the results depending on the time part being formatted. Currently we have special logic for part.type === "timeZoneName", and we'd need to add another special case for month to include the literal or pass both parts or whatever.

Before doing any type of patch, it would be wise to confirm that icambron's analysis from years ago still holds true for all language codes. If it seems reasonable, maybe add that check as a unit test.

dobon avatar Jun 11 '25 05:06 dobon

~~With the latest version of Luxon I can no longer reproduce this bug in Chrome 137, Firefox 136 or Node 22. If you encounter this problem again, feel free to comment here to we can reopen this ticket.~~

diesieben07 avatar Jun 12 '25 17:06 diesieben07

I'm on Chrome Version 137.0.7151.104 and Brave Version something-almost-current and can reproduce this bug. I'm using Windows 10.

> luxon.Info.months('short', { locale: 'ja-JP' })
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12']
>luxon.Info.months('short', { locale: 'zh-HK' })
['1月', '2月', '3月', '4月', '5月', '6月', '7月', '8月', '9月', '10月', '11月', '12月']

Or am I missing something?

dobon avatar Jun 12 '25 18:06 dobon

Confirmed on MacOS as well:

luxon.Info.months('short', { locale: 'ja-JP' })
> (12) ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12']

Using:

<script src="https://cdn.jsdelivr.net/npm/[email protected]/build/global/luxon.min.js"></script>

andrewplummer avatar Jun 12 '25 18:06 andrewplummer

Apparently I failed at copy-pasting properly multiple times, sorry. I will look into this later.

diesieben07 avatar Jun 12 '25 18:06 diesieben07

I ran a script to validate that the checks made by icambron are still valid and they are:

Language ja has extra parts for long [{"type":"month","value":"6"},{"type":"literal","value":"月"}] 6月
Language ja has extra parts for short [{"type":"month","value":"6"},{"type":"literal","value":"月"}] 6月
Language ja has extra parts for narrow [{"type":"month","value":"6"},{"type":"literal","value":"月"}] 6月
Language dz has extra parts for long [{"type":"literal","value":"སྤྱི་"},{"type":"month","value":"སྤྱི་ཟླ་དྲུག་པ"}] སྤྱི་སྤྱི་ཟླ་དྲུག་པ
Language dz has extra parts for short [{"type":"literal","value":"སྤྱི་"},{"type":"month","value":"ཟླ་༦"}] སྤྱི་ཟླ་༦
Language dz has extra parts for narrow [{"type":"literal","value":"སྤྱི་"},{"type":"month","value":"༦"}] སྤྱི་༦

By adding a special case for Info.months (not monthsFormat) we could fix "ja", but break "dz". I don't think it would be a good idea to try and do something even more involved, like only taking the literal when it follows the month to work around the dz break.

As for Info.monthsFormat, I don't think it can be fixed, because there is no good way to know which parts to take and which not to take.

I will include the fix for Info.months in the next version.

My script:

const languages = [...]; // obtained from https://unicode.org/cldr/charts/47/supplemental/locale_coverage.html
const monthOptions = ["long", "short", "narrow"];
const date = new Date();

for (const language of languages) {
  for (const monthOption of monthOptions) {
    const intl = new Intl.DateTimeFormat(language, { month: monthOption });
    const parts = intl.formatToParts(date);
    if (parts.some(part => part.type !== "month")) {
      console.log(`Language ${language} has extra parts for ${monthOption}`, JSON.stringify(parts), intl.format(date));
    }
  }
}

diesieben07 avatar Jun 12 '25 22:06 diesieben07

Thanks for looking into this and writing the patch, @diesieben07

I read your commit and noticed the commented out if statement in the extract method and am curious why that code won't work for both months and monthsFormat, instead of choosing between the two mappers in the months method. You mention that 'there is no good way to know which parts to take and which not to take', but it seems to me that the commented-out if statement is narrow enough to ensure that it only ever applies to japanese language and when extracting the months field, so we can always know there is a "月" literal after the month. To be perfectly clear about what I'm suggesting code-wise, without making a PR myself just to ask a question: 2 hypothetical edits to 1a51faa58f0d96da6ec70851f9b240ec37cd25a7

  1. Revert/revise the changes to Info.months method so that it always uses (dt) => this.extract(dt, intl, "month") as the mapper, like it did before;
  2. Uncomment the if statement in the extract method.

leveraging the narrowness of the if statement's conditions, the extract method could be simplified to something like:

extract(dt, intlOpts, field) {
  const df = this.dtFormatter(dt, intlOpts),
    results = df.formatToParts(),
    matching = results.findIndex((m) => m.type.toLowerCase() === field);
  if (matching === -1) return null;
  const value = results[matching].value;
  // #549 - Special Case: Japanese language formatToParts outputs 月 as "literal" instead of "month"
  if (field === "month" && this.intl.startsWith("ja-JP")) {
    return value + results[matching + 1].value;
  }
  return value;
}

What I like about this "one mapper + handle special case in extract" variation is that it is subjectively very clear to me what the special case is and where/how it is being handled (as opposed to the choose-a-mapper logic where it is less clear, imo), and it handles monthsFormat at the same time as months. Doing so passes all unit tests, including the new ones, but you obviously commented it out and took a different approach on purpose, so I'm curious if there's a subtle gotcha that I haven't bumped into yet. Cheers.

dobon avatar Jun 13 '25 08:06 dobon

Hi @dobon,

I was not sure which way to take either and ultimately decided on not special casing a locale. I didn't pay close enough attention and accidentally left in the commented code. I really don't like either solution. What I don't like about checking for the literal is that I fear it might have side-effects in case this is ever fixed in browsers. Calling format() with { months: "xxx" } is something that I feel is less likely to break than trying to detect what formatToParts produces.

diesieben07 avatar Jun 16 '25 20:06 diesieben07

I agree that it is a hacky workaround for a bug in not-our-code no matter how you slice it, but for what it is worth my preference is still for the explicit special casing a locale approach, along with the unit tests you added that will act as a canary for any changes to browser implementation in the future. I fear that the pick-a-mapper approach risks trading the idiosyncrasies of formatToParts for the idiosyncrasies of format, which to be fair seems to work just fine and solve the problems we are facing right now, but hides the details of the specific workaround we are addressing (ie: that it is a narrow bug only with japanese months), and it introduces an unnecessary inconsistency where the behavior is fixed for Info.months but not for Info.monthsFormat.

In my opinion, the unit tests acting as a canary for future fixes to not-our-code is what makes the explicit special casing a locale approach the better choice. In principle, I'd like to see a special case for ja and another for dz, along with canary unit tests that will signal if/when those special cases are no longer needed. In doing so, if/when the browser behavior changes we will be alerted to the change pretty quickly and know to remove the explicit workaround code, as opposed to the pick-a-mapper approach where if/when the browser implementation fix their bug we won't notice because our code deliberately avoids that part of the browser implementation, and so we'll keep the workaround logic after it is no longer needed.

Ideally we wouldn't have this kind of workaround at all, but if we must then I think it is preferable to make the workaround as narrow as possible and to have a canary unit test to alert us if/when the workaround can be removed. Thanks for your thoughtful reply, I appreciate it.

dobon avatar Jun 17 '25 08:06 dobon