php-imap icon indicating copy to clipboard operation
php-imap copied to clipboard

$message->getTextBody() retrieves whole source, not just the plain text message

Open TonyMarston opened this issue 1 year ago • 12 comments

I have a message containing plain text, no html and no attachments, but when I use ->getTextBody() it returns the entire source code and not just the message text. The source code is as follows:

Return-Path: Delivered-To: [email protected] Received: from ion.dnsprotect.com by ion.dnsprotect.com with LMTP id oPy8IzIke2Rr4gIAzEkvSQ (envelope-from ) for ; Sat, 03 Jun 2023 07:29:54 -0400 Return-path: Envelope-to: [email protected] Delivery-date: Sat, 03 Jun 2023 07:29:54 -0400 Received: from [::1] (port=48740 helo=ion.dnsprotect.com) by ion.dnsprotect.com with esmtpa (Exim 4.96) (envelope-from ) id 1q5PSF-000nPQ-1F for [email protected]; Sat, 03 Jun 2023 07:29:54 -0400 MIME-Version: 1.0 Date: Sat, 03 Jun 2023 07:29:54 -0400 From: radicore To: [email protected] Subject: Test Message User-Agent: Roundcube Webmail/1.6.0 Message-ID: X-Sender: [email protected] Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit X-From-Rewrite: unmodified, already matched

This is just a test, so ignore it (if you can!)

Tony Marston

I expect it to return just the plain text message, not the entire email.

TonyMarston avatar Jun 03 '23 14:06 TonyMarston

Hi @TonyMarston,

Thanks a lot for reporting this issue. I really appreciate it! However, in order to help you out, it would be great if you could provide an anonymized version of the problematic message. Without that, it's quite tough for me to debug the issue accurately.

If you're using an older version of the library, I recommend updating to the latest version and giving it another shot. There's a chance that the problem might have already been fixed in the newer release.

Once again, thanks for taking the time and effort to make this library better! If you have any more questions or need further assistance, feel free to let me know.

Best regards and happy coding!

Webklex avatar Jun 23 '23 14:06 Webklex

Here is the email in question

[email protected]

TonyMarston avatar Jun 23 '23 17:06 TonyMarston

Hi @TonyMarston , many thanks for the quick followup. Unfortunately I'm unable to replicate the behavior (see the referenced commit above).

Best regards and happy coding,

Webklex avatar Jun 23 '23 19:06 Webklex

I am afraid that your unit test is not following the same path through the code as when I run it. I have stepped through the same message with my debugger several times and it is failing to extract the text message from the raw body in exactly the same place. This is the path through the code that I have observed:

query.php, $query->getMessageByMsgn(); query.php, $query->getMessage(); message.php; $message->__construct(); message.php; $message->parseBody(); message.php; $message->parseRawBody(); structure.php; $structure->parse(); structure.php; $structure->find_parts();

It is in the find_parts() method that the code is failing to separate the text message from the raw body. Does your unit test follow the same path through the code?

TonyMarston avatar Jun 25 '23 15:06 TonyMarston

Hi @TonyMarston , the test is pretty similar:

  • Message::fromFile
    • Message::fromString
      • Message::parseRawHeader
        • Header::__construct
      • Message::parseRawBody
        • Structure::__construct
          • Structure::parse
            • Structure::findContentType
            • Structure::find_parts
          • Message::fetchStructure
  • Message::getTextBody

Which version are you currently using?

Best regards & happy coding,

Webklex avatar Jun 25 '23 18:06 Webklex

I am using 5.3. I see you have just released version 5.4. I shall install that and try again.

TonyMarston avatar Jun 25 '23 19:06 TonyMarston

I have just tried 5.4 with the same result. When it gets to Part::find_parts the contents of $this->header is not null, so it sets $body = $this->raw which is then becomes $this->content. It is Part::find_parts which is not extracting the test message out of the raw body.

TonyMarston avatar Jun 25 '23 19:06 TonyMarston

I updated the sample - in order to make sure I didn't screw up the initial sample and added a live mailbox test. You could try to enable the debug mode inside your config - even if unlikely, but perhaps this brings some insight. Besides this I'm out of ideas..

Out of curiosity:

  • Which OS are you using?
  • Can you share some actual code?
  • Which config are you using?
  • Can you replicate the behavior if you are using a different mail hoster?
  • Which hoster are you using or if you are your own hoster, which software are you using?
  • Which php version are you using?
  • Which php modules are enabled?

Best regards,

Webklex avatar Jun 25 '23 20:06 Webklex

I am using Windows 10 on my local PC, I am not running on a remote host. My PHP version is 8.2.7

I have stepped through with my debugger again and I see that the problem lies in the Structure class. The constructor calls Structure::parse which in turn calls Structure::find_parts, but this only returns a single part which contains the raw body as it cannot separate the body text from the raw raw body. This is because the raw body only contains a single Content-Type which is "text/plain; charset=US-ASCII; format=flowed" - notice that there is no 'multipart' - and as there is no boundary the code cannot use this to extract the message text from the raw body, so it uses the whole of the raw body which includes the header.

The script I use to call your library is attached. scan_email_inbox(batch).zip

TonyMarston avatar Jun 26 '23 09:06 TonyMarston

I see, thanks for the code:

  • Have you changed https://github.com/Webklex/php-imap/blob/master/src/config/imap.php#L147 from IMAP::ST_UID to IMAP::ST_MSGN? If you haven't and you do, does the issue persist?
  • If you switch from Query::getMessageByMsgn() to Query::getMessageByUid() does this change anything?
  • If you enable the debug mode, do you see something like this:
>> TAG13 FETCH 1 (RFC822.HEADER)
<< * 1 FETCH (RFC822.HEADER {1047}
...
<< TAG13 OK Fetch completed (0.001 + 0.000 secs).
>> TAG14 FETCH 1 (RFC822.TEXT)
<< * 1 FETCH (FLAGS (\Seen) RFC822.TEXT {65}
...

..or something else? What gets returned after the FETCH (FLAGS or (RFC822.TEXT)? The tags are certainly different as well as the uids / msgns but there should be the message content somewhere in there. How and where it gets returned is the interesting part :)

If you try the following:

$folder->query()->all()->chunked(function($messages, $page) {
    foreach ($messages as $message) {
        /** @var Message $message */
        var_dump([
                 'uid' => $message->uid,
                 'subject' => $message->subject,
                 'text' => $message->getTextBody()
             ]);
    }
}, 10, 1);

..does this change anything?

On a side note; you can use (string)$message->subject instead of $message->subject->get() or just treat any message attribute as string / array. Both are supported :)

Unfortunately I can't run tests on windows, but I have tested it with PHP 8.2.7 as well.

Best regards & happy coding,

Webklex avatar Jun 26 '23 19:06 Webklex

I have tried changing IMAP::ST_UID to IMAP::ST_MSGN but it makes no difference. I have tried switching from Query::getMessageByMsgn() to Query::getMessageByUid() but it makes no difference. I have enable debug mode but I cannot see any output. I have tried inserting the code you suggsted, but getTextBody() still returns the entire raw body and not just the body text.

I can only repeat what I said in an earlier post - I have stepped through with my debugger again and I see that the problem lies in the Structure class. The constructor calls Structure::parse which in turn calls Structure::find_parts, but this only returns a single part which contains the raw body as it cannot separate the body text from the raw raw body. This is because the raw body only contains a single Content-Type which is "text/plain; charset=US-ASCII; format=flowed" - notice that there is no 'multipart' - and as there is no boundary the code cannot use this to extract the message text from the raw body, so it uses the whole of the raw body which includes the header.

In this particular email the code is incapable of separating out the text body from the raw body as it cannot identify a usable boundary.

TonyMarston avatar Jun 27 '23 11:06 TonyMarston

I have searched through your code and cannot find anywhere where it extracts text which starts with 'Content-Type: text/plain' and which, because it does not have 'multi-part', does not have a boundary. I have fixed this myself by amending the contents of the findParts() method inside file structure.php (see attached zip file) Structure.zip

TonyMarston avatar Jul 03 '23 16:07 TonyMarston