php-imap
php-imap copied to clipboard
$message->getTextBody() retrieves whole source, not just the plain text message
I have a message containing plain text, no html and no attachments, but when I use ->getTextBody() it returns the entire source code and not just the message text. The source code is as follows:
This is just a test, so ignore it (if you can!)
Tony Marston
Hi @TonyMarston,
Thanks a lot for reporting this issue. I really appreciate it! However, in order to help you out, it would be great if you could provide an anonymized version of the problematic message. Without that, it's quite tough for me to debug the issue accurately.
If you're using an older version of the library, I recommend updating to the latest version and giving it another shot. There's a chance that the problem might have already been fixed in the newer release.
Once again, thanks for taking the time and effort to make this library better! If you have any more questions or need further assistance, feel free to let me know.
Best regards and happy coding!
Hi @TonyMarston , many thanks for the quick followup. Unfortunately I'm unable to replicate the behavior (see the referenced commit above).
Best regards and happy coding,
I am afraid that your unit test is not following the same path through the code as when I run it. I have stepped through the same message with my debugger several times and it is failing to extract the text message from the raw body in exactly the same place. This is the path through the code that I have observed:
query.php, $query->getMessageByMsgn(); query.php, $query->getMessage(); message.php; $message->__construct(); message.php; $message->parseBody(); message.php; $message->parseRawBody(); structure.php; $structure->parse(); structure.php; $structure->find_parts();
It is in the find_parts() method that the code is failing to separate the text message from the raw body. Does your unit test follow the same path through the code?
Hi @TonyMarston , the test is pretty similar:
-
Message::fromFile
-
Message::fromString
-
Message::parseRawHeader
-
Header::__construct
-
-
Message::parseRawBody
-
Structure::__construct
-
Structure::parse
-
Structure::findContentType
-
Structure::find_parts
-
-
Message::fetchStructure
-
-
-
-
-
Message::getTextBody
Which version are you currently using?
Best regards & happy coding,
I am using 5.3. I see you have just released version 5.4. I shall install that and try again.
I have just tried 5.4 with the same result. When it gets to Part::find_parts the contents of $this->header is not null, so it sets $body = $this->raw which is then becomes $this->content. It is Part::find_parts which is not extracting the test message out of the raw body.
I updated the sample - in order to make sure I didn't screw up the initial sample and added a live mailbox test. You could try to enable the debug mode inside your config - even if unlikely, but perhaps this brings some insight. Besides this I'm out of ideas..
Out of curiosity:
- Which OS are you using?
- Can you share some actual code?
- Which config are you using?
- Can you replicate the behavior if you are using a different mail hoster?
- Which hoster are you using or if you are your own hoster, which software are you using?
- Which php version are you using?
- Which php modules are enabled?
Best regards,
I am using Windows 10 on my local PC, I am not running on a remote host. My PHP version is 8.2.7
I have stepped through with my debugger again and I see that the problem lies in the Structure class. The constructor calls Structure::parse which in turn calls Structure::find_parts, but this only returns a single part which contains the raw body as it cannot separate the body text from the raw raw body. This is because the raw body only contains a single Content-Type which is "text/plain; charset=US-ASCII; format=flowed" - notice that there is no 'multipart' - and as there is no boundary the code cannot use this to extract the message text from the raw body, so it uses the whole of the raw body which includes the header.
The script I use to call your library is attached. scan_email_inbox(batch).zip
I see, thanks for the code:
- Have you changed https://github.com/Webklex/php-imap/blob/master/src/config/imap.php#L147 from
IMAP::ST_UID
toIMAP::ST_MSGN
? If you haven't and you do, does the issue persist? - If you switch from
Query::getMessageByMsgn()
toQuery::getMessageByUid()
does this change anything? - If you enable the debug mode, do you see something like this:
>> TAG13 FETCH 1 (RFC822.HEADER)
<< * 1 FETCH (RFC822.HEADER {1047}
...
<< TAG13 OK Fetch completed (0.001 + 0.000 secs).
>> TAG14 FETCH 1 (RFC822.TEXT)
<< * 1 FETCH (FLAGS (\Seen) RFC822.TEXT {65}
...
..or something else? What gets returned after the FETCH (FLAGS
or (RFC822.TEXT)
?
The tags are certainly different as well as the uids / msgns but there should be the message content somewhere in there. How and where it gets returned is the interesting part :)
If you try the following:
$folder->query()->all()->chunked(function($messages, $page) {
foreach ($messages as $message) {
/** @var Message $message */
var_dump([
'uid' => $message->uid,
'subject' => $message->subject,
'text' => $message->getTextBody()
]);
}
}, 10, 1);
..does this change anything?
On a side note; you can use (string)$message->subject
instead of $message->subject->get()
or just treat any message attribute as string / array. Both are supported :)
Unfortunately I can't run tests on windows, but I have tested it with PHP 8.2.7 as well.
Best regards & happy coding,
I have tried changing IMAP::ST_UID to IMAP::ST_MSGN but it makes no difference. I have tried switching from Query::getMessageByMsgn() to Query::getMessageByUid() but it makes no difference. I have enable debug mode but I cannot see any output. I have tried inserting the code you suggsted, but getTextBody() still returns the entire raw body and not just the body text.
I can only repeat what I said in an earlier post - I have stepped through with my debugger again and I see that the problem lies in the Structure class. The constructor calls Structure::parse which in turn calls Structure::find_parts, but this only returns a single part which contains the raw body as it cannot separate the body text from the raw raw body. This is because the raw body only contains a single Content-Type which is "text/plain; charset=US-ASCII; format=flowed" - notice that there is no 'multipart' - and as there is no boundary the code cannot use this to extract the message text from the raw body, so it uses the whole of the raw body which includes the header.
In this particular email the code is incapable of separating out the text body from the raw body as it cannot identify a usable boundary.
I have searched through your code and cannot find anywhere where it extracts text which starts with 'Content-Type: text/plain' and which, because it does not have 'multi-part', does not have a boundary. I have fixed this myself by amending the contents of the findParts() method inside file structure.php (see attached zip file) Structure.zip