email2pdf icon indicating copy to clipboard operation
email2pdf copied to clipboard

Unicode enhancement

Open gpgmailencrypt opened this issue 9 years ago • 3 comments

the main part of this pull request is to enhance the unicode support. The changes mainly consist of using the correct .encode() and .decode() calls. You might be surprised about the function _decodetxt(), which I have taken from my own project. The main reason is, that the payload.get_payload(decode=True) simply does not return unicode. That's the "it's not a bug, it's a feature"-thing. The maintainer of the python email module believes, he has to handle it this way due to a backwards compatibility. So "decode=True" is switched off and the _decodetxt() function does return utf8.

The other smaller changes are a "--overwrite" command line option, so that email2pdf does not stop working when the output file does already exist.

Last not least I had to remove in line 351 the wkhtml2pdf options '--load-error-handling' and '--load-media-error-handling'. These do not exist in Ubuntu 14.04

gpgmailencrypt avatar Sep 20 '15 14:09 gpgmailencrypt

Horst,

Thanks for your pull request. A few thoughts:

  • Regarding the --overwrite option: great idea, I've never needed it so hadn't added one, but I can see that would be useful to some folk besides you. Once I've cleaned up the commit a bit and added some unit tests for it, I'll merge this into email2pdf.
  • Regarding the removal of --load-error-handling and --load-media-error-handling: I don't think I'd like to remove these; they are crucial to working around issues with some types of broken emails, for example ones with broken embedded images. However, I have realised that I should modelling those with unit tests, which I'm not (my unit test suite still passes despite your removal of these flags). I've opened issue #91 to represent that. I would suggest that you install the latest version directly from the wkhtmltopdf website as per the install instructions to avoid the error due to missing flags. It's not ideal, I know, but the version is 14.04 is very old.
  • As far as the Unicode support goes; this is the one that confuses me. You say payload.get_payload(decode=True) doesn't return Unicode, but it's been my experience that it does. The Python documentation seems to concur; it says here that it returns a string when is_multipart is False, and I think all Python strings in 3+ are Unicode. Could you please help me by being a bit more specific what your _decodetxt function() is working around? I can't figure it out from your code what you are doing. Do you have any references to the problem on the web? I think the best way to illustrate the issue would be to create a failing test which your code fixes; can you help me figure out how I would do that?

Thanks for your interest - much appreciated!

andrewferrier avatar Sep 20 '15 18:09 andrewferrier

Am 20.09.2015 um 20:32 schrieb Andrew Ferrier:

Horst,

Thanks for your pull request. A few thoughts:

Regarding the |--overwrite| option: great idea, I've never needed
it so hadn't added one, but I can see that would be useful to some
folk besides you. Once I've cleaned up the commit a bit and added
some unit tests for it, I'll merge this into email2pdf.
Regarding the removal of the |--load-error-handling|: I don't
think I'd like to remove these; they are crucial to working around
issues with some types of broken emails, for example ones with
broken embedded images. However, I have realised that I should
modelling those with unit tests, which I'm not (my unit test suite
still passes despite your removal of these flags). I've opened
issue #91 <https://github.com/andrewferrier/email2pdf/issues/91>
to represent that. I would suggest that you install the latest
version directly from the wkhtmltopdf website as per the install
instructions
<https://github.com/andrewferrier/email2pdf#debianubuntu> to avoid
that issue. It's not ideal, I know, but the version is 14.04 is
/very/ old.

Yes I agree. I'm writing an encrypting email gateway, where I currently add encrypted pdf emails. After I made the pull request I installed my software on the production server (also ubuntu 14.04). Unfortunately the ubuntu version of wkhtmltopdf needs an installed X server, which is unacceptable on a server. So I had to change to a newer wkhtmltopdf package, and the |--load-error-handling reappeared|

||

As far as the Unicode support goes; this is the one that confuses
me. You say |payload.get_payload(decode=True)| doesn't return
Unicode, but it's been my experience that it does. The Python
documentation seems to concur; it says here
<https://docs.python.org/3/library/email.message.html#email.message.Message.get_payload>
that it returns a string when |is_multipart| is False, and I think
all Python strings in 3+ are Unicode. Could you please help me by
being a bit more specific what your _decodetxt function() is
working around? I can't figure it out from your code what you are
doing. Do you have any references to the problem on the web? I
think the best way to illustrate the issue would be to create a
failing test which your code fixes; can you help me figure out how
I would do that?

This is a really difficult thing and did cost me a lot of time. See http://bugs.python.org/issue18271 for more information. The code in _decodetxt is more or less the original decode function, just that it ensures, that it always delivers unicode.

Thanks for your interest - much appreciated!

— Reply to this email directly or view it on GitHub https://github.com/andrewferrier/email2pdf/pull/90#issuecomment-141819998.

gpgmailencrypt avatar Sep 25 '15 06:09 gpgmailencrypt

The topic is also mentioned here: https://github.com/andrewferrier/email2pdf/issues/34

Now one of the .eml files I ran the script on also produced:

Traceback (most recent call last):
  File "/usr/bin/email2pdf", line 733, in call_main
    (warning_pending, mostly_hide_warnings) = main(argv, syslog_handler, syserr_handler)
  File "/usr/bin/email2pdf", line 109, in main
    input_data = get_input_data(args)
  File "/usr/bin/email2pdf", line 261, in get_input_data
    data = input_handle.read()
  File "/home/user1/.virtualenvs/email2pdf_env/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 755: invalid start byte

and I'm trying to work out the most efficient way to fix it.

I ran eml2pdf with a bash script on a collection of files in a directory and noticed that some of the resulting pdfs had correct unicode while others were messed up. Then I added --encoding to the command and it could be switched around which were good and which were bad.

The contents of the directory looks like this:

$ ls
1.eml  2.eml  2.pdf  3.eml  4.eml  5.eml  6.eml

So digging deeper:

 $ grep -r charset .
./2.eml:	charset="iso-8859-1"
./6.eml:Content-Type: text/plain; charset=windows-1252; format=flowed
./1.eml:	charset="iso-8859-1"
./1.eml:	charset="iso-8859-1"
./1.eml:charset=3Diso-8859-1">
./5.eml:Content-Type: text/plain; charset=windows-1252; format=flowed
./3.eml:Content-Type: text/plain; charset=UTF-8
./4.eml:Content-Type: text/plain; charset=windows-1252; format=flowed

What a mess. Does this script read out these things? I think thunderbird does when you print to pdf. Now I'm even wondering if it is wkhtmltopdf not eml2pdf which is not reading out the charset

Actually, I tried to run the script on the .eml files individually first grepping the charset out of the .eml files and that still did not help it! Printing to pdf file from thunderbird does work, and who knows what magic thunderbird is doing to find out the encoding.

https://stackoverflow.com/questions/2281646/whats-the-difference-between-encoding-and-charset

There is even more to it:

https://stackoverflow.com/questions/39235436/python-auto-detect-email-content-encoding

$ grep -r Content-Transfer-Encoding .
./2.eml:Content-Transfer-Encoding: quoted-printable
./6.eml:Content-Transfer-Encoding: 8bit
./1.eml:Content-Transfer-Encoding: quoted-printable
./1.eml:Content-Transfer-Encoding: quoted-printable
./5.eml:Content-Transfer-Encoding: 8bit
./3.eml:Content-Transfer-Encoding: quoted-printable
./4.eml:Content-Transfer-Encoding: 8bit

I give up for now and just need to urgently get the task done, will use Thunderbird and it's gui manually for all the files, but maybe someone one day posts a solution or fix. I don't envy @andrewferrier 's task of working out all these encoding's, charsets and Content-Transfer-Encoding 's

Apart from the unicode issue the script seems to work perfectly.

aktivkohle avatar Mar 16 '21 14:03 aktivkohle