mail-mime-parser icon indicating copy to clipboard operation
mail-mime-parser copied to clipboard

Add configuration to set the default charset for content without a specified charset

Open johnss opened this issue 5 years ago • 18 comments

it seems that QP encoding cannot support multi equal sign it only support 1 equal sign

for example =E2=80=93 should convert to – but it shows –

johnss avatar Feb 21 '20 13:02 johnss

Hi @johnss --

Is this a mime-encoded quoted printable part, or part of a message body? What encoding is used for the part? Preferably a full example would help me test it/confirm the issue...

All the best

zbateson avatar Feb 21 '20 19:02 zbateson

Part of message body using quoted printable as content transfer encoding via getHtmlContent() method, UTF-8 as html encoding

johnss avatar Feb 22 '20 12:02 johnss

I created it using android chrome and save it as mhtml, it actually saved pages of x.com but i modified it to reproduce this issue. It show – using chrome, – using getHtmlContent() rename to .mhtml or .mht extension to view it in chrome x.com.txt

johnss avatar Feb 22 '20 12:02 johnss

here is bin2hex result bin2hex('–'); // e28093 bin2hex('–') // c3a2c280c293

johnss avatar Feb 22 '20 13:02 johnss

Hi @johnss,

The html part of the message in your example doesn't correctly define a charset. You can manually override that if you want by calling setCharsetOverride, for example:

$message->getHtmlPart()->setCharsetOverride('utf-8');
echo $message->getHtmlContent();

All the best.

zbateson avatar Feb 23 '20 23:02 zbateson

what you not mention it docs? please add it to documentation

johnss avatar Feb 24 '20 10:02 johnss

setCharsetOverride only mentioned in api docs generated by phpdocumentor, which many people rarely visit those pages, so many dev are not aware that method exist, please mention to pages with higher traffic

johnss avatar Feb 24 '20 11:02 johnss

what encoding used when setCharsetOverride is not set? utf-8 is de facto standard used by nearly all web sites, why not default to utf-8?

johnss avatar Feb 25 '20 14:02 johnss

Hi @johnss,

It's not a bad suggestion -- my understanding is UTF-8 is fully backwards-compatible with ISO-8859-1. In researching this a bit, I couldn't find a reason not to default to UTF-8, but also it surprised me that Thunderbird defaults to ISO-8859-1 given they're fully compatible.

I think the ideal would be to have the default configurable rather than setting an override for a single email... and have the default configured charset UTF-8.

I'd be interested to hear from others more knowledgeable on this -- any reason why we shouldn't default to UTF-8?

zbateson avatar Feb 27 '20 19:02 zbateson

Looking more closely at this, UTF-8 and ISO-8859-1 are only the same for 0-127 (ASCII). This causes problems if an email contains non-ASCII characters and expects the default to be considered ISO-8859-1 instead of UTF-8. Setting the default to UTF-8 causes tests/_data/emails/m0009 to fail, but not tests/_data/emails/m0008 -- m0009 is ISO-8859-1 encoded without specifying a charset, m0008 is UTF-8 encoded. You can also note the differences in the files as they're the same text, the UTF-8 variant uses multiple bytes to encode codepoints above 127, whereas the ISO-8859-1 variant doesn't.

Instead, the option could be available though to change the default if you're interested in submitting a pull request.

zbateson avatar Apr 10 '20 21:04 zbateson

I read the RFC (see https://github.com/zbateson/mail-mime-parser/issues/133#issue-675312518) as if you use non-ASCII characters you must declare a charset in the Content-Type header. Right?

ThomasLandauer avatar Aug 11 '20 13:08 ThomasLandauer

Yeah, although there's no harm in expanding that to either ISO-8859-1 or UTF-8, as they're both compatible for the first 127 bytes.

zbateson avatar Aug 11 '20 15:08 zbateson

First: I'm not sure if ->setCharsetOverride() is actually doing what you had in mind. I'm understanding https://github.com/zbateson/mail-mime-parser/issues/133#issuecomment-670775985 that it would only set a default charset (i.e. only makes a difference if there is no charset declaration in the mail). However, it actually overrides whatever is defined in the mail.

Second:

any reason why we shouldn't default to UTF-8?

To sum it up, the situation is: The RFC demands that you declare a charset if you use non-ASCII characters. And your question is: If somebody does not stick to this (i.e. no charset declaration), which one should you use as the default?

I wanted to provide some data for this from the mails I'm currently analyzing. (They're mostly German, so probably every single one does contain some non-ASCII characters.) Well, but since ->setCharsetOverride() isn't doing what I thought it would do (see above), there are no results ;-)
If you include a function that really just sets the default charset, I could try again.

ThomasLandauer avatar Aug 20 '20 17:08 ThomasLandauer

The point is though, that you can check if a charset isn't set, and use setCharsetOverride if it isn't, thereby setting your own default charset using that.

zbateson avatar Aug 21 '20 22:08 zbateson

I don't know why there are no results on your specific case and emails without further details of what you're doing.

zbateson avatar Aug 21 '20 22:08 zbateson

and use setCharsetOverride if it isn't, thereby setting your own default charset using that.

Well, if I override the existing charset, it's not "default" anymore! Default means: If there is no value, use this one.

Do you want me to run this check at all? If yes, please give me the code part I'm missing: Check if there is a charset declared - for the entire message or just for the text/plain part.

ThomasLandauer avatar Aug 22 '20 10:08 ThomasLandauer

Well, if I override the existing charset, it's not "default" anymore! Default means: If there is no value, use this one.

We're running in circles a bit here :stuck_out_tongue: .

I said "you can check if a charset isn't set, and use setCharsetOverride if it isn't"

You can call $part->getHeaderParameter('content-type', 'charset'); and check if the return value is null.

zbateson avatar Aug 24 '20 17:08 zbateson

I can now report from the emails I'm analyzing: Before, 0.36% had text in UTF-8, but without a charset declaration (and were therefore displayed wrong).

With the code from https://github.com/zbateson/mail-mime-parser/issues/136#issuecomment-680022755 I now have 0.02% that have text in ISO-8859-1 (or similar) without a charset declaration (and are therefore displayed wrong).

So implementing what this issue asks for (a configuration to let the user set the default charset to e.g. UTF-8) is a good idea IMO, since it reduces the problem cases by more than factor 10.

Just for the records: It looks like most (German speaking) companies that do not declare a charset, send text in UTF-8 (rather than ASCII as the RFC says).

ThomasLandauer avatar Aug 25 '20 13:08 ThomasLandauer