php-imap
php-imap copied to clipboard
Decoding the subject with iconv_mime_decode()
Describe the bug
Currently Webklex/php-imap by default uses 'utf-8' decoder and \imap_utf8() function to decode subject:https://github.com/Webklex/php-imap/blob/abf080eb745e21df7a3070ca356248dda021d4ad/src/Header.php#L424
This approach does not always decode subjects properly.
For example Subject: =?ISO-2022-JP?B?GyRCIXlCaBsoQjEzMhskQjlmISEhViUsITwlRyVzGyhCJhskQiUoJS8lOSVGJWolIiFXQGxMZ0U5JE4kPyRhJE4jURsoQiYbJEIjQSU1JW0lcyEhIVo3bjQpJSglLyU5JUYlaiUiISYlbyE8JS8hWxsoQg==?= will be decoded into garbled string:
$B!yBh(B132$B9f!!!V%,!<%G%s(B&$B%(%/%9%F%j%"!W@lLgE9$N$?$a$N#Q(B&$B#A%5%m%s!!!Z7n4)%(%/%9%F%j%"!&%o!<%/![(B
So it does not make much sense to use by default the approach which is not 100% reliable.
Some time ago we've switched to decoding the subject simply with iconv_mime_decode():
$value = iconv_mime_decode($value, ICONV_MIME_DECODE_CONTINUE_ON_ERROR, "UTF-8");
It works always and properly - https://github.com/freescout-helpdesk/freescout/issues/2965#issuecomment-1534134241.
So maybe it worth making iconv_mime_decode() the default and the only approach for decoding the subject like this: https://github.com/freescout-helpdesk/freescout/blob/master/overrides/webklex/php-imap/src/Header.php#L208
Function:
https://github.com/freescout-helpdesk/freescout/blob/dist/app/Misc/Mail.php#L909 (decodeSubject())
Related issue: https://github.com/Webklex/laravel-imap/issues/295
Hi @freescout-helpdesk , many thanks for your report and suggestion. I improved the decoder and added a new test case for the mentioned subject above.
Best regards & happy coding,
@Webklex Attachment::decodeName() has the same problem
Hi @nuernbergerA , thanks for your report. I just pushed a new test including your mentioned scenario but can't reproduce the described issue. Are you using the latest release? If not, please update and try again. There's a chance that the problem might have already been fixed in a newer release.
If the issue persists, please head over to https://github.com/Webklex/php-imap/issues/new?assignees=&labels=&projects=&template=bug_report.md and provide as much information as possible. Every little bit, even if it might seem unrelated could help to solve the mystery :)
Best regards & happy coding,
@Webklex i've opened a PR with a failing test for my issue ;)
@Webklex i've opened a PR with a failing test for my issue ;)
We've checked. This our function https://github.com/freescout-helpdesk/freescout/blob/dist/app/Misc/Mail.php#L883C8-L883C8 also successfully works for decoding headers and attachment names including your example:
=?iso-8859-1?Q?2021=5FM=E4ngelliste=5F0819306.xlsx?= >> 2021_Mängelliste_0819306.xlsx
@nuernbergerA thanks for your input and your pull request. Please update to v5.5 and give it another try :)
@freescout-helpdesk how are you decoding attachment names / filenames? If you are relying on Attachment::decodeName(), please note that this if statement was reversed:
https://github.com/Webklex/php-imap/blob/5.5.0/src/Attachment.php#L296-L300
The fallback was used as default, which works often but can cause unexpected behavior (https://github.com/Webklex/php-imap/pull/421) in some cases :/
Best regards & happy coding,
@freescout-helpdesk how are you decoding attachment names / filenames?
We are using the same our MailHelper::decodeSubject() function to decode attachment names - it works fine.
@Webklex
$personal = '=?gb2312?B?z+O42yCDfN14u80=?='; imap_mime_header_decode() The result obtained by this method:��� �|�x��
public function testUTF8Content()
{
$personal = '=?gb2312?B?z+O42yCDfN14u80=?=';
$personal1 = imap_mime_header_decode($personal);
echo '<pre>';
print_r(['personal1'=>$personal1]);
$header = new Header('');
$personalParts = $header->mime_header_decode($personal);
$personal2 = '';
foreach ($personalParts as $p) {
print_r(['pText'=>$p->text]);
$personal2 .= $header->convertEncoding($p->text, $header->getEncoding($p));
}
print_r(['personal2'=>$personal2]);
}
public function testUTF8Content()
{
$personal = '=?gb2312?B?z+O42yCDfN14u80=?=';
$personal2 = imap_utf8($personal);
print_r(['personal2'=>$personal2]);
$header = new Header('');
$personalParts = $header->mime_header_decode($personal);
$personal3 = '';
foreach ($personalParts as $p) {
print_r(['pText'=>$p->text]);
//$personal3 .= $header->convertEncoding($p->text, $header->getEncoding($p));
$personal3 .= iconv($header->getEncoding($p), 'UTF-8//IGNORE', $p->text);
}
print_r(['personal3'=>$personal3]);
}
I have a related problem. I received an email with an attachment, whose filename is "iso-8859-1''Elk%FCld%F6tt%20felsz%F3l%EDt%E1s%20ELK-F-24011370.pdf"
The filename gets decoded as Elk�ld�tt felsz�l�t�s ELK-F-24011370.pdf
After url decoding the string it should be converted to UTF-8:
if (str_contains($name, "''")) {
$parts = explode("''", $name);
if (EncodingAliases::has($parts[0])) {
$name = implode("''", array_slice($parts, 1));
}
}
[...]
// check if $name is url encoded
if (preg_match('/%[0-9A-F]{2}/i', $name)) {
$name = urldecode($name);
}
It should be something like:
if (str_contains($name, "''")) {
$parts = explode("''", $name);
if (EncodingAliases::has($parts[0])) {
$charset = $parts[0];
$name = implode("''", array_slice($parts, 1));
}
}
[...]
// check if $name is url encoded
if (preg_match('/%[0-9A-F]{2}/i', $name)) {
$name = urldecode($name);
if (isset($charset)) {
$name = mb_convert_encoding($name, 'UTF-8', $charset);
}
}
I am not sure of the implications, so I haven't done a pull request yet.
Actually it seems that the first part of the code is not equivalent to the RFC 2231 Specification, where a string can be represented like charset'language'encoded-text, in my case iso-8859-1''Elk%FCld%F6tt%20felsz%F3l%EDt%E1s%20ELK-F-24011370.pdf which translates in:
- charset: iso-8859-1
- language: empty
- encoded-text: Elk%FCld%F6tt%20felsz%F3l%EDt%E1s%20ELK-F-24011370.pdf
So I think, this:
if (str_contains($name, "''")) {
$parts = explode("''", $name);
if (EncodingAliases::has($parts[0])) {
$name = implode("''", array_slice($parts, 1));
}
}
may be replaced with:
$parts = explode("'", $name, 3);
if (EncodingAliases::has($parts[0])) {
list($charset, $language, $name) = $parts;
}
My pull request did not pass the following test:
AttachmentLongFilenameTest.php
self::assertEquals("f7b5181985862431bfc443d26e3af2371e20a0afd676eeb9b9595a26d42e0b73", hash("sha256", $attachment->filename));
The assertion in the test assumes that the output filename of iso-8859-15''%30%31%5F%41%A4%E0%E4%3F%3F%3F%3F%40%5A%2D%30%31%32%33%34%35%36%37%38%39%2D%71%77%65%72%74%79%75%69%6F%70%61%73%64%66%67%68%6A%6B%6C%7A%78%63%76%62%6E%6D%6F%70%71%72%73%74%75%76%7A%2D%30%31%32%33%34%35%36%37%38%39%2D%71%77%65%72%74%79%75%69%6F%70%61%73%64%66%67%68%6A%6B%6C%7A%78%63%76%62%6E%6D%6F%70%71%72%73%74%75%76%7A%2D%30%31%32%33%34%35%36%37%38%39%2D%71%77%65%72%74%79%75%69%6F%70%61%73%64%66%67%68%6A%6B%6C%7A%78%63%76%62%6E%6D%6F%70%71%72%73%74%75%76%7A%2E%74%78%74 is in its original encoding and not in utf-8.
Maybe I am missing something, but should this be the intended behaviour (leaving the original encoding as is), how can I get the encoding of the filename in the Attachment class, so I can convert it myself if needed?