pymilter icon indicating copy to clipboard operation
pymilter copied to clipboard

str vs bytes decisions

Open sdgathman opened this issue 8 years ago • 40 comments

It is fairly easy to chose between str and bytes in the C API, it is typically choosing s or y in the format strings for PyArg_Parse and Py_BuildValue. One exception is that there is no equivalent of z (allow None and convert to NULL) for y. This prevents simply making everything bytes, even though the C side does everything in 8-bits.

Obviously, body and replacebody are bytes. Pathname (from connect callback for unix socket) is recommended by C-API docs to be str, converted by a standard pathname converter (which handles unix vs windows, etc).

The header and addheader I made bytes, but then chgheader has to deal with passing None, so I can't simply use the y format, and making chgheader str while the other two are bytes would be inconsistent.

sdgathman avatar Nov 01 '16 20:11 sdgathman

After several months to reflect, I think header, addheader should be str to be consistent with chgheader. There may be useful things to do with encoding in the future. (internationalizing, for instance)

sdgathman avatar Mar 20 '17 04:03 sdgathman

i just happened to received some emails with non-ascii chars in the subject line without properly encoding, pymilter threw "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa6 in position 8: invalid start byte".. I don't know where to look at because there was no traceback..:-(

william6502 avatar Aug 07 '18 12:08 william6502

Is this with python3?

sdgathman avatar Aug 07 '18 14:08 sdgathman

yea..using Python 3.5.1

william6502 avatar Aug 07 '18 17:08 william6502

i just happened to received some emails with non-ascii chars in the subject line without properly encoding, pymilter threw "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa6 in position 8: invalid start byte".. I don't know where to look at because there was no traceback..:-(

Same here. How do I fix this? I'm using Python 3.6.8 on CentOS 7 and I get a similar error. Tested on both python36-pymilter-1.0.3-2.el7 rpm and on 1.0.4. Same code works with Python 2.7.5.

$ python Python 2.7.5 (default, Jun 20 2019, 20:27:34) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import chardet >>> val=b'\xc0\xb4\xd7\xd4' >>> val.decode() Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position 0: ordinal not in range(128) >>> val.decode('latin-1') u'\xc0\xb4\xd7\xd4' >>> chardet.detect(val) {'confidence': 0.7679697235616183, 'language': 'Russian', 'encoding': 'KOI8-R'} >>> val.decode('KOI8-R') u'\u044e\u2562\u0432\u0442'

$ python3 Python 3.6.8 (default, Apr 25 2019, 21:02:35) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import chardet >>> val=b'\xc0\xb4\xd7\xd4' >>> val.decode() Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 0: invalid start byte >>> val.decode('latin-1') 'À´×Ô' >>> chardet.detect(val) {'encoding': 'KOI8-R', 'confidence': 0.7679697235616183} >>> val.decode('KOI8-R') 'ю╢вт'

apircalabu avatar Jun 27 '19 06:06 apircalabu

i fixed it in my pymilter fork and tested on a production server for a long time, so far so good.

william6502 avatar Jun 27 '19 06:06 william6502

Great, now testing, thanks.

apircalabu avatar Jun 27 '19 07:06 apircalabu

More important than testing for a long time is a test case to add to unit testing. Did you save the email that made it fail? Maybe just the Subject header field to add to a test email..

sdgathman avatar Jun 27 '19 19:06 sdgathman

The testing mail is pretty simple, just change an existing mail subject line to contain some utf-8 characters, then it will crash the milter. I didn't create a pull request as my patch require change in the milter to work with new header return (bytes) which will break a lot of existing milter. I'm using my patched version on a logistic company email server which mails tend to be from different countries/mail clients and language encoding, so far it works pretty well.

william6502 avatar Jun 27 '19 23:06 william6502

More important than testing for a long time is a test case to add to unit testing. Did you save the email that made it fail? Maybe just the Subject header field to add to a test email..

Perhaps a unit test against an email that has this string as Subject value? b'\xC0\xB4\xD7\xD4\x71\x71\x2E\x63\x6F\x6D\xB5\xC4\xCD\xCB\xD0\xC5'

apircalabu avatar Jun 27 '19 23:06 apircalabu

I'm having a slightly different experience with the UTF-8 characters in Subject header. In the case of UTF-8 bytes being present in the header, the milter does not crash. However when non-UTF8 (and by extension non-ascii) bytes are present it does crash.

So for example the byte 163 "£" symbol (8859-1) in a subject line will cause the milter to crash. However the equivalent in UTF-8 (0xC2 0xA3) is causing no problem.

If I use @william6502 's fork and return bytes then the milter does not crash. As it happens because Postfix tolerates these invalid bytes in Headers, we have to be able to ignore/handle them. At present our Milter cannot deal with these emails.

We are running Python 3.5.3

scandox avatar Jul 25 '19 14:07 scandox

I think what is needed is for the low level header callback to pass bytes. Then have another layer that can be hooked. I.e. the non-OO python callback is in Milter.init.py and should take bytes, converting via a supplied encoding before calling the OO header method. The encoding can be set to None for the OO header method to get bytes, and I can introduce an encoding exception method as well.

What do you think?

sdgathman avatar Jul 25 '19 21:07 sdgathman

Encoding exceptions in a callback are somewhat awkward. Hey, I think the encoding should be supplied via a decorator, since that works well for a lot of the other callback options!

sdgathman avatar Jul 25 '19 21:07 sdgathman

What about the other way? To make the low level addheader and chgheader take bytes, we need to null terminate the bytes in miltermodule, since the Python C API does not provide that function in python3. Or did I miss it?

Wait, isn't there an encoding that is essentially "hi bytes zero"? How do we make the C API use that? *** Goes to search docs...

sdgathman avatar Jul 25 '19 21:07 sdgathman

Well I would very much defer to your judgement. I don't know half as much as I should about how libmilter or pymilter work - except that they're a huge boon to me!

Your first suggestion sounds good if I understood it. If I understand the situation correctly there's probably a lot of code out there with a header callback that expects a string (including probably in my infra). If I could add a decorator that caused it to receive bytes instead that would be fantastic from a useability / compatibility point of view.

scandox avatar Jul 25 '19 22:07 scandox

In reviewing this long standing issue, I realized that chgheader is the only API making bytes for everything at the low level inconsistent. True, py3 offers no 'y' equivalent of 'z'. BUT, I can simply test for None, and alter the chgheader call accordingly. So I can make everything bytes. Then, we can patch up compatibility stuff in python.

sdgathman avatar Aug 10 '19 15:08 sdgathman

Bytes everywhere would be great. For what I do, and I suspect most common usage, I just want to get the whole email as bytes, and then use email.message_from_binary_file() in the eom() callback where everything interesting happens.

andredalle avatar Aug 19 '19 09:08 andredalle

@scandox Added test case that gets this traceback in python3: a01f598

ERROR: testHeader (testsample.BMSMilterTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/stuart/src/pymilter/testsample.py", line 33, in testHeader
    rc = ctx._feedFile(fp)
  File "/home/stuart/src/pymilter/Milter/testctx.py", line 259, in _feedFile
    msg = mime.message_from_file(fp)
  File "/home/stuart/src/pymilter/mime.py", line 310, in message_from_file
    msg = message_from_binary_file(fp,MimeMessage)
  File "/usr/lib64/python3.7/email/__init__.py", line 62, in message_from_binary_file
    return BytesParser(*args, **kws).parse(fp)
  File "/usr/lib64/python3.7/email/parser.py", line 110, in parse
    return self.parser.parse(fp, headersonly)
  File "/usr/lib64/python3.7/email/parser.py", line 54, in parse
    data = fp.read(8192)
  File "/usr/lib64/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 156: invalid start byte

Does that cover the problems in production? Do I need another test case?

sdgathman avatar Aug 20 '19 22:08 sdgathman

Nah, test case was not reading file as binary. Now it uncovers a bug in testctx.

sdgathman avatar Aug 20 '19 23:08 sdgathman

Once again, the built-in email package utterly fails me. >:-( While it seems to be able to parse the malformed test file with email.message_from_binary_file(fp), and the invalid header results in a header object, which (reasonably) does not convert to a str, there doesn't seem to be any way to get the actual bytes of the header. I need that to be able to pass bytes to callbacks in the test frameworks - if I am going to change the low level API.

sdgathman avatar Aug 27 '19 23:08 sdgathman

I may have to port mimetools from python2.7. :-(

sdgathman avatar Aug 28 '19 00:08 sdgathman

The screwup happens early in email.parser.BytesParser.parse:

  fp = TextIOWrapper(fp, encoding='ascii', errors='surrogateescape')

sdgathman avatar Aug 28 '19 00:08 sdgathman

surrogateescape saves the original bytes in a way that encode can recover. So, 4749f0f now produces this on the invalid header bytes test case:

ERROR: testHeader (testsample.BMSMilterTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/stuart/src/pymilter/testsample.py", line 33, in testHeader
    rc = ctx._feedFile(fp)
  File "/home/stuart/src/pymilter/Milter/testctx.py", line 287, in _feedFile
    rc = self._header(h,val)
  File "/home/stuart/src/pymilter/Milter/testctx.py", line 236, in _header
    return Milter.header_callback(self,fld,v)
  File "/home/stuart/src/pymilter/Milter/__init__.py", line 709, in header_callback
    return m.header(fld,s)
  File "/home/stuart/src/pymilter/sample.py", line 105, in header
    self.log('%s: %s' % (name,val))
  File "/home/stuart/src/pymilter/sample.py", line 27, in log
    for i in msg: print(i,end=None)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 9-12: surrogates not allowed

At this point, an application could recover the original bytes like I did in testctx.py:

b = s.encode(encoding='ascii',errors='surrogateescape')

But, with this test case added, I'll work on adding a decorator for header callback.

sdgathman avatar Aug 28 '19 01:08 sdgathman

Instead of a decorator, I think it is cleaner to add a header_bytes() method to Milter.Base, which is invoked on callback:

  def header_bytes(self,fld,val):
     s = val.decode(encoding='ascii',errors='surrogateescape')
     return self.header(fld,s)

Applications could then override header_bytes() to deal directly with bytes, or header() to get a surrogate escaped unicode str. Any objections? Suggestions for better name?

sdgathman avatar Dec 31 '19 23:12 sdgathman

Ok, that solution makes protocol_mask() too complicated, as it then has to deal with an alias for the header call back. How about we just pass the original bytes as a keyword arg which the app can ignore? No, that requires the encode even if not used. How about if apps just do:

class MyMilter(Milter.Base):
  def __init__():
    self.header_bytes = self.header
    ...

Is that clean enough to skip the surrogate encoding? Maybe a class decorator could do that?

sdgathman avatar Jan 01 '20 00:01 sdgathman

Next idea: have default header_bytes pass str decoded as 'ascii'. If that fails, pass the bytes instead. That counts as an api change - is it too late to make that change?

sdgathman avatar Jan 01 '20 01:01 sdgathman

Just a quick question since i stumbled upon this issue, too: Is the fix ready to be pushed to pypi.org as a new version?

JPT580 avatar Jan 30 '20 16:01 JPT580

@JPT580 I think @sdgathman is still cogitating on the best way to fix it. I think there's a lot of different considerations involved reviewing this thread. So not an easy decision.

We are using @william6502 's fork which works for our purposes but which might not suit all cases.

scandox avatar Jan 30 '20 16:01 scandox

@scandox Thanks for your reply! I will take a look into getting that fork running.

JPT580 avatar Jan 30 '20 16:01 JPT580

@JPT580

Just a quick question since i stumbled upon this issue, too: Is the fix ready to be pushed to pypi.org as a new version?

The current version on github has bytes available via surrogate encoding. It looks like pypi.org has 1.0.4. Let me check if that includes surrogate encoding.

Going forward, I want to have both header_bytes and header callbacks. The idea is for milter module to invoke header_bytes callback, which in turn invokes header callback with bytes decoded from ascii to str with surrogates. The tricky part is making sure the magic callback decorators still work.

You have always been able to hook the low level module callback to get bytes.

sdgathman avatar Jan 30 '20 17:01 sdgathman