php-imap icon indicating copy to clipboard operation
php-imap copied to clipboard

Attachments (file)names are not correctly decoded

Open marien-probesys opened this issue 2 years ago • 5 comments

Describe the bug

In some cases, the attachments (file)names are not correctly decoded and contain invalid characters. This happens for names encoded like this: ISO-8859-1''caf%E9.txt. Note that it's not using encoded-words (btw, I cannot find the name of this encoding, do you know it?). The ISO-8859-1 encoding is simply ignored.

Used config

    'options' => [
        'decoder' => [
            'message' => 'iconv',
            'attachment' => 'iconv',
        ],
    ],

Code to Reproduce

$clientManager = new \Webklex\PHPIMAP\ClientManager();

$clientManager->setConfig([
    'options' => [
        'decoder' => [
            'message' => 'iconv',
            'attachment' => 'iconv',
        ],
    ],
]);

$email = file_get_contents(__DIR__ . '/email.txt');

$message = \Webklex\PHPIMAP\Message::fromString($email);

foreach ($message->getAttachments() as $attachment) {
    $name = $attachment->getName();
    echo "Attachment: {$name}\n";
}

You can find an example of problematic email: email.txt (generated with Gnome Evolution).

Expected behavior

The attachment name should be café.txt, but it is caf�.txt.

Desktop / Server (please complete the following information):

  • OS: Docker image php:8.1-fpm (Debian I guess?)
  • PHP: 8.1
  • Version: 5.5.0
  • Provider: Gnome Evolution

Additional context

I was able to spot the issue.

In Attachment::decodeName, you test that $name contains the string '' and get the "real" name from it, but you drop the encoding. In my example, ISO-8859-1''caf%E9.txt becomes caf%E9.txt.

Few lines later, you urldecode() the name. Unfortunately, in my case, %E9 is ISO-8859-1 for the character é, while it would be %C3%A9 in UTF-8. Meaning that we still need to convert the string from ISO-8859-1 to UTF-8 with EncodingAliases::convert($name, $encoding) ($encoding being $parts[0] extracted earlier).

marien-probesys avatar Nov 30 '23 16:11 marien-probesys

I had the same problem but with another config.

$clientManager->setConfig([
    'options' => [
        'decoder' => [
            'message' => 'utf-8',
            'attachment' => 'utf-8',
        ],
    ],
]);

My solution is to convert the name of the attachment lik this:

echo mb_convert_encoding($attachment->getName(), 'UTF-8', 'ISO-8859-1');

bjaverhagen avatar Dec 13 '23 16:12 bjaverhagen

I did something similar too. The problem with this solution is that we don't know the encoding of the initial string. Meaning that if it's not ISO-8859-1, we end with the same issue (the unsupported characters being replaced by question marks, which may look nicer). This has to be done at the PHP-IMAP level to work properly. Or can we access the raw name (e.g. ISO-8859-1''caf%E9.txt) to extract the encoding ourselves?

Side note: the issue happens also with the UTF-8 decoder indeed. I've been back to this decoder: the issues that I had with it have been fixed after installing the PHP ldap extension. It would be worth a separated issue in GitHub but I don't have much time these days. Don't hesitate to get back to me on this subject after the holidays :)

marien-probesys avatar Dec 14 '23 08:12 marien-probesys