mail-api Returning non-UTF-8 charset from getDefaultMIMECharset() provides negative value these days

getDefaultMIMECharset() tries to use a charset from a system property and falls back on the system encoding.

It would make more sense to use UTF-8 unconditionally. MUAs have been able to consume UTF-8 for a long time now, and e.g. Gmail and Apple Mail always send UTF-8, so a MUA unable to ingest UTF-8 in unable to deal with email from the most popular MUAs.

These days, making an effort to use a legacy encoding for outgoing email is more likely to cause compatibility problems than fix them, so using a legacy encoding (i.e. anything but UTF-8) for outgoing email doesn't make sense.

(I came here due to a report that at least some old version of JavaMail emits pre-java.nio Java encoding names (e.g. ISO8859_1 instead of iso-8859-1), which is incompatible with clients that implement Encoding Standard-compliant label resolution, such as Thunderbird. It's unclear to me if it's still possible for JavaMail to emit pre-java.nio names.)

Oct 18 '18 12:10 hsivonen

JavaMail should have been emitting ISO standard charset names from the beginning, at least for most charsets. JavaMail includes a mapping table to map from the old JDK charset names to the standard names. It's possible that some charsets weren't in that mapping table, but fortunately the JDK has moved to using ISO standard names so that shouldn't be an issue with recent versions of the JDK and JavaMail. If you find an instance where the wrong charset name is being used, please provide details.

There are still some locales that use charsets other than utf-8. If the system is running in such a locale, it should use that charset. Many systems are running in utf-8 locales, so JavaMail will use utf-8. If your system isn't using utf-8 but you want to force it for JavaMail, you can set the mail.mime.charset System property to "utf-8". And if your system isn't running in a utf-8 locale, maybe you should ask why that is and consider fixing that.

Oct 18 '18 21:10 bshannon

JavaMail should have been emitting ISO standard charset names from the beginning, at least for most charsets.

Apparently not for ISO-8859-1 according to the Thunderbird bug.

There are still some locales that use charsets other than utf-8. If the system is running in such a locale, it should use that charset.

This issue is about disagreement on what "should" happen if the system locale is associated with a non-UTF-8 encoding. I'm saying that these days, it's a better guess that the recipient can receive UTF-8 than that the recipient can receive what the sender's system considers as its local encoding.

Oct 19 '18 10:10 hsivonen

It is possible that old versions of JavaMail did not apply the mapping correctly, but no reproducible bug has been reported. And of course it's possible that some application is using JavaMail incorrectly and thus causing this bug. For example, some people include JavaMail classes in their application without also including all the configuration files, including this mapping table. An application that makes this mistake might exhibit this bug. If you can provide more information about the application that exhibits the problem in the Thunderbird bug report, we might be able to determine the source of the problem. Otherwise, I'm not sure what you expect me to do about that.

As for what the default behavior should be, that's a more difficult question. My understanding is that many non-European locales still use local charsets, and changing to utf-8 might impact them negatively. This is likely more of a concern for the use of JavaMail in automated email processing than it is for email messages read by humans. In any event, this is not something that could be considered until JavaMail 1.7.

Oct 19 '18 22:10 bshannon

Otherwise, I'm not sure what you expect me to do about that.

I expect it to be taken as a data point of the hazards of non-use of UTF-8. Specifically, had UTF-8 been used, an Encoding Standard-compliant receiver could have dealt with either the pre-java.nio name or the java.nio name.

My understanding is that many non-European locales still use local charsets, and changing to utf-8 might impact them negatively.

Whether legacy encodings are still used by someone else for sending email is not relevant. The relevant question is whether there are recipients that could successfully receive email in the sender's JRE-reported non-UTF-8 system encoding but could not successfully receive UTF-8 email.

All macOS system locales use UTF-8. On the Red Hat side of the Linux world, all locales use UTF-8 by default since 2002. On the Debian side of the Linux world, all locales use UTF-8 by default since 2007 (earlier for Ubuntu). The Windows 8-bit legacy encoding is not a useful measure of what is needed, because Windows has migrated to UTF-16 APIs instead of migrating its 8-bit locale-specific defaults.

Gmail and Apple Mail (without a third-party hack; see below) send only UTF-8.

Therefore, recipients that cannot receive UTF-8 email but could receive email in a non-UTF-8 JRE-reported system encoding would not be able to receive email from Gmail, Apple Mail, JavaMail running on Mac or JavaMail running on Linux distros with the default settings.

In the case of Traditional Chinese, especially for Hong Kong supplementary characters, the legacy situation is less interoperable than UTF-8, so Thunderbird removed the ability to send email as Big5. (Thunderbird also removed the ability to manually change the encoding for outgoing email from the UTF-8 default to a Cyrillic, Central European, Thai, Hebrew or Arabic legacy encoding. The ability to manually set ISO-8859-7, windows-1252, EUC-KR, ISO-2022-JP, gbk or gb18030 for outgoing email is still there, but probably wouldn't need to be, except maybe for ISO-2022-JP. The UI also says ISO-8859-1, but it has the same effect as choosing windows-1252.)

If an extension created for Apple Mail and Gmail forum complaints (can't find a reference right now) from some years ago are any indication, the last place where email recipients had trouble with UTF-8 was Japan, but in the case of Japan, the preferred non-UTF-8 email encoding was ISO-2022-JP, which isn't a system encoding anywhere, so getDefaultMIMECharset() doesn't even pick the most-likely-to-work legacy encoding in the case of assuming that Japan still needs non-UTF-8. (Thunderbird changed the default outgoing encoding for the Japanese locale from ISO-2022-JP to UTF-8 at version 52.)

This is likely more of a concern for the use of JavaMail in automated email processing than it is for email messages read by humans.

So the concern is JavaMail on a Windows host sending email to some automated process (as opposed to a human-operated email client) that also runs on Windows, in the same locale, and can't receive UTF-8?

Oct 20 '18 14:10 hsivonen

Otherwise, I'm not sure what you expect me to do about that.
I expect it to be taken as a data point of the hazards of non-use of UTF-8. Specifically, had UTF-8 been used, an Encoding Standard-compliant receiver could have dealt with either the pre-java.nio name or the java.nio name.

If utf-8 had been forced, there would be no use of pre-java.nio names.

My understanding is that many non-European locales still use local charsets,
and changing to utf-8 might impact them negatively.
Whether legacy encodings are still used by someone else for sending email is not relevant. The relevant question is whether there are recipients that could successfully receive email in the sender's JRE-reported non-UTF-8 system encoding but could not successfully receive UTF-8 email.

Exactly.

This is likely more of a concern for the use of JavaMail in automated email
processing than it is for email messages read by humans.
So the concern is JavaMail on a Windows host sending email to some automated process (as opposed to a human-operated email client) that also runs on Windows, in the same locale, and can't receive UTF-8?

This has nothing to do with Windows. But yes, some legacy mail processing applications may assume that the message is in a certain encoding without even checking the specified charset.

Also, when attaching a file to a message, the file will most often be in the encoding used by the local system. If that encoding is not specified when the file is attached, JavaMail will choose the local system default encoding and add an explicit charset parameter to the message part. Forcing the default to be utf-8 would result in an incorrect description for the local file data.

Changing the default charset to use when the text is supplied as a Java Unicode String to utf-8 is worth considering, but it will have an impact on compatibility with previous releases of JavaMail. I'll look at this for JavaMail 1.7.

Oct 22 '18 19:10 bshannon

mail-api mail-api copied to clipboard

Returning non-UTF-8 charset from getDefaultMIMECharset() provides negative value these days

mail-api
mail-api copied to clipboard