robotframework-imaplibrary icon indicating copy to clipboard operation
robotframework-imaplibrary copied to clipboard

How to extract PDF from a multipart email

Open winko opened this issue 11 years ago • 5 comments

Hello ! Is it possible to read and write to a file a part of a multipart email which is of type application/pdf ???

I have already tried with

    \    ${content-type}=    Get Multipart Content Type
    \    ${payload}=    Get Multipart Payload
    \    Run Keyword If    '${content-type}' == 'application/pdf'    Create File    blablabla_NotDecoded.pdf    ${payload}

but the generated file could not be read by Adobe Reader.

And also

    \    ${content-type}=    Get Multipart Content Type
    \    ${payload}=    Get Multipart Payload    decode=True
    \    Run Keyword If    '${content-type}' == 'application/pdf'    Create File    blablabla_Decoded.pdf    ${payload}

didn't work since RF said: "UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position 11: ordinal not in range(128)"

Regards

winko avatar Feb 26 '14 14:02 winko

@winko, I added the multipart code for HTML - have never tested with other content types.

Just curious, what is the charset of the mime part containing the pdf? In my testing, emails have charset "UTF-8", so it's surprising that RF used 'ascii'.

If you know some python, the quickest way to debug this is to write a short script using imaplib and email modules to retrieve and decode the email. Here's a snippet to get your started, but you'll have to hack it until it works..

import imaplib
import email

server='imap.gmail.com'
user=''
pw=''
crit=['FROM', '', 'TO', '', 'UNSEEN']
pdf_index = 1

imap=imaplib.IMAP4_SSL(server,993)
imap.login(user,pw)
imap.select(readonly=True)

typ, msgnums = imap.search(None,*crit)
data = imap.fetch(msgnums[0].split()[-1], '(RFC822)')[1][0][1]
imap.close()

msg = email.message_from_string(data.decode())
pdf_part = msg.get_payload()[pdf_index]
pdf = pdf_part.get_payload(decode=True)

martinhill avatar Feb 26 '14 17:02 martinhill

@martinhill could you implement a keyword for extracting the pdf data and send me a pull request please?

memcmp avatar Feb 28 '14 08:02 memcmp

@bogensberger I tried extracting pdf data in Python and I think the code should work fine as is, using decode=True. I suspect the problem is to do with Create File.

I will send you a pull request for a different issue, though. I found a problem with gmail, when the email arrives after the Open Mailbox keyword was executed.

@winko can you run your second case with debugging on and attach the log?

martinhill avatar Feb 28 '14 14:02 martinhill

Thanks for your investigations!

Here comes my debug log:

20140228 17:05:02.201 - INFO - +----- START KW: ${content-type} = ImapLibrary.Get Multipart Content Type [ ]
20140228 17:05:02.201 - INFO - ${content-type} = application/pdf
20140228 17:05:02.201 - INFO - +----- END KW: ${content-type} = ImapLibrary.Get Multipart Content Type (0)
20140228 17:05:02.202 - INFO - +----- START KW: ${payload} = ImapLibrary.Get Multipart Payload [ decode=True ]
20140228 17:05:02.297 - INFO - ${payload} = %PDF-1.4
%\x83\x92\xfa\xfe
1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj
3 0 obj
<<
/CreationDate (D:20030618124400)
/Author (\xfe\xff K a t j a   M \xf6 l l e r)
/Keywords ()
/Subject...
20140228 17:05:02.298 - INFO - +----- END KW: ${payload} = ImapLibrary.Get Multipart Payload (1)
20140228 17:05:02.298 - INFO - +----- START KW: OperatingSystem.Create File [ blablabla.pdf | ${payload} ]
20140228 17:05:02.317 - FAIL - UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position 11: ordinal not in range(128)
20140228 17:05:02.318 - DEBUG - Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\robot\libraries\OperatingSystem.py", line 611, in create_file
    path = self._write_to_file(path, content, encoding, 'w')
  File "C:\Python27\lib\site-packages\robot\libraries\OperatingSystem.py", line 630, in _write_to_file
    f.write(content.encode(encoding))
20140228 17:05:02.318 - INFO - +----- END KW: OperatingSystem.Create File (20)

winko avatar Feb 28 '14 16:02 winko

@winko It seems robot framework can't always handle binary data such as PDF. Looking at the robot package libraries/OperatingSystem.py I found:

def _write_to_file(self, path, content, encoding, mode):
        path = self._absnorm(path)
        parent = os.path.dirname(path)
        if not os.path.exists(parent):
            os.makedirs(parent)
        f = open(path, mode+'b')
        try:
            f.write(content.encode(encoding))
        finally:
            f.close()
        return path

I tested out writing pdf to file like this:

f.write(pdf)

..and it worked. It is binary data which should not be encoded as Create File keyword does. I can only suggest writing your own keyword to save your pdf data. All you need is to clone the code from OperatingSystem.py and remove the encoding step. Alternatively, find a way to write your test that doesn't save the pdf to a file.

Martin

martinhill avatar Mar 02 '14 04:03 martinhill