pymilter
pymilter copied to clipboard
str vs bytes decisions
It is fairly easy to chose between str and bytes in the C API, it is typically choosing s or y in the format strings for PyArg_Parse and Py_BuildValue. One exception is that there is no equivalent of z (allow None and convert to NULL) for y. This prevents simply making everything bytes, even though the C side does everything in 8-bits.
Obviously, body and replacebody are bytes. Pathname (from connect callback for unix socket) is recommended by C-API docs to be str, converted by a standard pathname converter (which handles unix vs windows, etc).
The header and addheader I made bytes, but then chgheader has to deal with passing None, so I can't simply use the y format, and making chgheader str while the other two are bytes would be inconsistent.
After several months to reflect, I think header, addheader should be str to be consistent with chgheader. There may be useful things to do with encoding in the future. (internationalizing, for instance)
i just happened to received some emails with non-ascii chars in the subject line without properly encoding, pymilter threw "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa6 in position 8: invalid start byte".. I don't know where to look at because there was no traceback..:-(
Is this with python3?
yea..using Python 3.5.1
i just happened to received some emails with non-ascii chars in the subject line without properly encoding, pymilter threw "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa6 in position 8: invalid start byte".. I don't know where to look at because there was no traceback..:-(
Same here. How do I fix this? I'm using Python 3.6.8 on CentOS 7 and I get a similar error. Tested on both python36-pymilter-1.0.3-2.el7 rpm and on 1.0.4. Same code works with Python 2.7.5.
$ python
Python 2.7.5 (default, Jun 20 2019, 20:27:34)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import chardet
>>> val=b'\xc0\xb4\xd7\xd4'
>>> val.decode()
Traceback (most recent call last):
File "
$ python3
Python 3.6.8 (default, Apr 25 2019, 21:02:35)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import chardet
>>> val=b'\xc0\xb4\xd7\xd4'
>>> val.decode()
Traceback (most recent call last):
File "
i fixed it in my pymilter fork and tested on a production server for a long time, so far so good.
Great, now testing, thanks.
More important than testing for a long time is a test case to add to unit testing. Did you save the email that made it fail? Maybe just the Subject header field to add to a test email..
The testing mail is pretty simple, just change an existing mail subject line to contain some utf-8 characters, then it will crash the milter. I didn't create a pull request as my patch require change in the milter to work with new header return (bytes) which will break a lot of existing milter. I'm using my patched version on a logistic company email server which mails tend to be from different countries/mail clients and language encoding, so far it works pretty well.
More important than testing for a long time is a test case to add to unit testing. Did you save the email that made it fail? Maybe just the Subject header field to add to a test email..
Perhaps a unit test against an email that has this string as Subject value? b'\xC0\xB4\xD7\xD4\x71\x71\x2E\x63\x6F\x6D\xB5\xC4\xCD\xCB\xD0\xC5'
I'm having a slightly different experience with the UTF-8 characters in Subject header. In the case of UTF-8 bytes being present in the header, the milter does not crash. However when non-UTF8 (and by extension non-ascii) bytes are present it does crash.
So for example the byte 163 "£" symbol (8859-1) in a subject line will cause the milter to crash. However the equivalent in UTF-8 (0xC2 0xA3) is causing no problem.
If I use @william6502 's fork and return bytes then the milter does not crash. As it happens because Postfix tolerates these invalid bytes in Headers, we have to be able to ignore/handle them. At present our Milter cannot deal with these emails.
We are running Python 3.5.3
I think what is needed is for the low level header callback to pass bytes. Then have another layer that can be hooked. I.e. the non-OO python callback is in Milter.init.py and should take bytes, converting via a supplied encoding before calling the OO header method. The encoding can be set to None for the OO header method to get bytes, and I can introduce an encoding exception method as well.
What do you think?
Encoding exceptions in a callback are somewhat awkward. Hey, I think the encoding should be supplied via a decorator, since that works well for a lot of the other callback options!
What about the other way? To make the low level addheader and chgheader take bytes, we need to null terminate the bytes in miltermodule, since the Python C API does not provide that function in python3. Or did I miss it?
Wait, isn't there an encoding that is essentially "hi bytes zero"? How do we make the C API use that? *** Goes to search docs...
Well I would very much defer to your judgement. I don't know half as much as I should about how libmilter or pymilter work - except that they're a huge boon to me!
Your first suggestion sounds good if I understood it. If I understand the situation correctly there's probably a lot of code out there with a header callback that expects a string (including probably in my infra). If I could add a decorator that caused it to receive bytes instead that would be fantastic from a useability / compatibility point of view.
In reviewing this long standing issue, I realized that chgheader is the only API making bytes for everything at the low level inconsistent. True, py3 offers no 'y' equivalent of 'z'. BUT, I can simply test for None, and alter the chgheader call accordingly. So I can make everything bytes. Then, we can patch up compatibility stuff in python.
Bytes everywhere would be great. For what I do, and I suspect most common usage, I just want to get the whole email as bytes, and then use email.message_from_binary_file() in the eom() callback where everything interesting happens.
@scandox Added test case that gets this traceback in python3: a01f598
ERROR: testHeader (testsample.BMSMilterTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/stuart/src/pymilter/testsample.py", line 33, in testHeader
rc = ctx._feedFile(fp)
File "/home/stuart/src/pymilter/Milter/testctx.py", line 259, in _feedFile
msg = mime.message_from_file(fp)
File "/home/stuart/src/pymilter/mime.py", line 310, in message_from_file
msg = message_from_binary_file(fp,MimeMessage)
File "/usr/lib64/python3.7/email/__init__.py", line 62, in message_from_binary_file
return BytesParser(*args, **kws).parse(fp)
File "/usr/lib64/python3.7/email/parser.py", line 110, in parse
return self.parser.parse(fp, headersonly)
File "/usr/lib64/python3.7/email/parser.py", line 54, in parse
data = fp.read(8192)
File "/usr/lib64/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 156: invalid start byte
Does that cover the problems in production? Do I need another test case?
Nah, test case was not reading file as binary. Now it uncovers a bug in testctx.
Once again, the built-in email package utterly fails me. >:-( While it seems to be able to parse the malformed test file with email.message_from_binary_file(fp), and the invalid header results in a header object, which (reasonably) does not convert to a str, there doesn't seem to be any way to get the actual bytes of the header. I need that to be able to pass bytes to callbacks in the test frameworks - if I am going to change the low level API.
I may have to port mimetools from python2.7. :-(
The screwup happens early in email.parser.BytesParser.parse:
fp = TextIOWrapper(fp, encoding='ascii', errors='surrogateescape')
surrogateescape saves the original bytes in a way that encode can recover. So, 4749f0f now produces this on the invalid header bytes test case:
ERROR: testHeader (testsample.BMSMilterTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/stuart/src/pymilter/testsample.py", line 33, in testHeader
rc = ctx._feedFile(fp)
File "/home/stuart/src/pymilter/Milter/testctx.py", line 287, in _feedFile
rc = self._header(h,val)
File "/home/stuart/src/pymilter/Milter/testctx.py", line 236, in _header
return Milter.header_callback(self,fld,v)
File "/home/stuart/src/pymilter/Milter/__init__.py", line 709, in header_callback
return m.header(fld,s)
File "/home/stuart/src/pymilter/sample.py", line 105, in header
self.log('%s: %s' % (name,val))
File "/home/stuart/src/pymilter/sample.py", line 27, in log
for i in msg: print(i,end=None)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 9-12: surrogates not allowed
At this point, an application could recover the original bytes like I did in testctx.py:
b = s.encode(encoding='ascii',errors='surrogateescape')
But, with this test case added, I'll work on adding a decorator for header callback.
Instead of a decorator, I think it is cleaner to add a header_bytes() method to Milter.Base, which is invoked on callback:
def header_bytes(self,fld,val):
s = val.decode(encoding='ascii',errors='surrogateescape')
return self.header(fld,s)
Applications could then override header_bytes() to deal directly with bytes, or header() to get a surrogate escaped unicode str. Any objections? Suggestions for better name?
Ok, that solution makes protocol_mask() too complicated, as it then has to deal with an alias for the header call back. How about we just pass the original bytes as a keyword arg which the app can ignore? No, that requires the encode even if not used. How about if apps just do:
class MyMilter(Milter.Base):
def __init__():
self.header_bytes = self.header
...
Is that clean enough to skip the surrogate encoding? Maybe a class decorator could do that?
Next idea: have default header_bytes pass str decoded as 'ascii'. If that fails, pass the bytes instead. That counts as an api change - is it too late to make that change?
Just a quick question since i stumbled upon this issue, too: Is the fix ready to be pushed to pypi.org as a new version?
@JPT580 I think @sdgathman is still cogitating on the best way to fix it. I think there's a lot of different considerations involved reviewing this thread. So not an easy decision.
We are using @william6502 's fork which works for our purposes but which might not suit all cases.
@scandox Thanks for your reply! I will take a look into getting that fork running.
@JPT580
Just a quick question since i stumbled upon this issue, too: Is the fix ready to be pushed to pypi.org as a new version?
The current version on github has bytes available via surrogate encoding. It looks like pypi.org has 1.0.4. Let me check if that includes surrogate encoding.
Going forward, I want to have both header_bytes and header callbacks. The idea is for milter module to invoke header_bytes callback, which in turn invokes header callback with bytes decoded from ascii to str with surrogates. The tricky part is making sure the magic callback decorators still work.
You have always been able to hook the low level module callback to get bytes.