email2pdf
email2pdf copied to clipboard
Unicode enhancement
the main part of this pull request is to enhance the unicode support. The changes mainly consist of using the correct .encode() and .decode() calls. You might be surprised about the function _decodetxt(), which I have taken from my own project. The main reason is, that the payload.get_payload(decode=True) simply does not return unicode. That's the "it's not a bug, it's a feature"-thing. The maintainer of the python email module believes, he has to handle it this way due to a backwards compatibility. So "decode=True" is switched off and the _decodetxt() function does return utf8.
The other smaller changes are a "--overwrite" command line option, so that email2pdf does not stop working when the output file does already exist.
Last not least I had to remove in line 351 the wkhtml2pdf options '--load-error-handling' and '--load-media-error-handling'. These do not exist in Ubuntu 14.04
Horst,
Thanks for your pull request. A few thoughts:
- Regarding the
--overwrite
option: great idea, I've never needed it so hadn't added one, but I can see that would be useful to some folk besides you. Once I've cleaned up the commit a bit and added some unit tests for it, I'll merge this into email2pdf. - Regarding the removal of
--load-error-handling
and--load-media-error-handling
: I don't think I'd like to remove these; they are crucial to working around issues with some types of broken emails, for example ones with broken embedded images. However, I have realised that I should modelling those with unit tests, which I'm not (my unit test suite still passes despite your removal of these flags). I've opened issue #91 to represent that. I would suggest that you install the latest version directly from the wkhtmltopdf website as per the install instructions to avoid the error due to missing flags. It's not ideal, I know, but the version is 14.04 is very old. - As far as the Unicode support goes; this is the one that confuses me. You say
payload.get_payload(decode=True)
doesn't return Unicode, but it's been my experience that it does. The Python documentation seems to concur; it says here that it returns a string whenis_multipart
is False, and I think all Python strings in 3+ are Unicode. Could you please help me by being a bit more specific what your _decodetxt function() is working around? I can't figure it out from your code what you are doing. Do you have any references to the problem on the web? I think the best way to illustrate the issue would be to create a failing test which your code fixes; can you help me figure out how I would do that?
Thanks for your interest - much appreciated!
Am 20.09.2015 um 20:32 schrieb Andrew Ferrier:
Horst,
Thanks for your pull request. A few thoughts:
Regarding the |--overwrite| option: great idea, I've never needed it so hadn't added one, but I can see that would be useful to some folk besides you. Once I've cleaned up the commit a bit and added some unit tests for it, I'll merge this into email2pdf.
Regarding the removal of the |--load-error-handling|: I don't think I'd like to remove these; they are crucial to working around issues with some types of broken emails, for example ones with broken embedded images. However, I have realised that I should modelling those with unit tests, which I'm not (my unit test suite still passes despite your removal of these flags). I've opened issue #91 <https://github.com/andrewferrier/email2pdf/issues/91> to represent that. I would suggest that you install the latest version directly from the wkhtmltopdf website as per the install instructions <https://github.com/andrewferrier/email2pdf#debianubuntu> to avoid that issue. It's not ideal, I know, but the version is 14.04 is /very/ old.
Yes I agree. I'm writing an encrypting email gateway, where I currently add encrypted pdf emails. After I made the pull request I installed my software on the production server (also ubuntu 14.04). Unfortunately the ubuntu version of wkhtmltopdf needs an installed X server, which is unacceptable on a server. So I had to change to a newer wkhtmltopdf package, and the |--load-error-handling reappeared|
||
As far as the Unicode support goes; this is the one that confuses me. You say |payload.get_payload(decode=True)| doesn't return Unicode, but it's been my experience that it does. The Python documentation seems to concur; it says here <https://docs.python.org/3/library/email.message.html#email.message.Message.get_payload> that it returns a string when |is_multipart| is False, and I think all Python strings in 3+ are Unicode. Could you please help me by being a bit more specific what your _decodetxt function() is working around? I can't figure it out from your code what you are doing. Do you have any references to the problem on the web? I think the best way to illustrate the issue would be to create a failing test which your code fixes; can you help me figure out how I would do that?
This is a really difficult thing and did cost me a lot of time. See http://bugs.python.org/issue18271 for more information. The code in _decodetxt is more or less the original decode function, just that it ensures, that it always delivers unicode.
Thanks for your interest - much appreciated!
— Reply to this email directly or view it on GitHub https://github.com/andrewferrier/email2pdf/pull/90#issuecomment-141819998.
The topic is also mentioned here: https://github.com/andrewferrier/email2pdf/issues/34
Now one of the .eml files I ran the script on also produced:
Traceback (most recent call last):
File "/usr/bin/email2pdf", line 733, in call_main
(warning_pending, mostly_hide_warnings) = main(argv, syslog_handler, syserr_handler)
File "/usr/bin/email2pdf", line 109, in main
input_data = get_input_data(args)
File "/usr/bin/email2pdf", line 261, in get_input_data
data = input_handle.read()
File "/home/user1/.virtualenvs/email2pdf_env/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 755: invalid start byte
and I'm trying to work out the most efficient way to fix it.
I ran eml2pdf with a bash script on a collection of files in a directory and noticed that some of the resulting pdfs had correct unicode while others were messed up. Then I added --encoding to the command and it could be switched around which were good and which were bad.
The contents of the directory looks like this:
$ ls
1.eml 2.eml 2.pdf 3.eml 4.eml 5.eml 6.eml
So digging deeper:
$ grep -r charset .
./2.eml: charset="iso-8859-1"
./6.eml:Content-Type: text/plain; charset=windows-1252; format=flowed
./1.eml: charset="iso-8859-1"
./1.eml: charset="iso-8859-1"
./1.eml:charset=3Diso-8859-1">
./5.eml:Content-Type: text/plain; charset=windows-1252; format=flowed
./3.eml:Content-Type: text/plain; charset=UTF-8
./4.eml:Content-Type: text/plain; charset=windows-1252; format=flowed
What a mess. Does this script read out these things? I think thunderbird does when you print to pdf. Now I'm even wondering if it is wkhtmltopdf not eml2pdf which is not reading out the charset
Actually, I tried to run the script on the .eml files individually first grepping the charset out of the .eml files and that still did not help it! Printing to pdf file from thunderbird does work, and who knows what magic thunderbird is doing to find out the encoding.
https://stackoverflow.com/questions/2281646/whats-the-difference-between-encoding-and-charset
There is even more to it:
https://stackoverflow.com/questions/39235436/python-auto-detect-email-content-encoding
$ grep -r Content-Transfer-Encoding .
./2.eml:Content-Transfer-Encoding: quoted-printable
./6.eml:Content-Transfer-Encoding: 8bit
./1.eml:Content-Transfer-Encoding: quoted-printable
./1.eml:Content-Transfer-Encoding: quoted-printable
./5.eml:Content-Transfer-Encoding: 8bit
./3.eml:Content-Transfer-Encoding: quoted-printable
./4.eml:Content-Transfer-Encoding: 8bit
I give up for now and just need to urgently get the task done, will use Thunderbird and it's gui manually for all the files, but maybe someone one day posts a solution or fix. I don't envy @andrewferrier 's task of working out all these encoding's, charsets and Content-Transfer-Encoding 's
Apart from the unicode issue the script seems to work perfectly.