pymilter
pymilter copied to clipboard
html.parser is not a working replacement for sgmllib
The general advice has been to switch to lxml - which is also available for python2. However, my primary target system (centos7) has lxml for python3 only (in epel).
there are numerous ways around this:
- use a virtualenv
- file an EPEL bug
- contribute the package to EPEL
- wait until someone else undertakes one of the above
But non of these should be relevant to making design choices for the milter package.
Yeah, between porting to lxml, and making html.parser work, lxml seems like the most efficient way to proceed. The requirement for any solution should be for SGMLFilter to keep the same API. That will preserve compatibility with existing milters.
Reading through the lxml docs, it is unacceptable for milter applications. It apparently only know how to build trees from sax events, and write out the tree again. What we have to have for most milter applications is the SAX events. So lxml is out - back to the drawing board.
I solved this for now by porting sgmllib to python3, and including it in the python3 package. The long term solution is to make xml.sax work, or find a good sax api library.
It looks like xml.parsers.expat might be able to do the job. There are wrappers to harden it against malicious XML, but when you are not building a tree, the risk is low anyway. We'll need more test cases for SGMLFilter that verify it a) doesn't crash b) leaves content unchanged (except for overridden handlers)