pymilter icon indicating copy to clipboard operation
pymilter copied to clipboard

html.parser is not a working replacement for sgmllib

Open sdgathman opened this issue 8 years ago • 5 comments

The general advice has been to switch to lxml - which is also available for python2. However, my primary target system (centos7) has lxml for python3 only (in epel).

sdgathman avatar Sep 26 '16 22:09 sdgathman

there are numerous ways around this:

  • use a virtualenv
  • file an EPEL bug
  • contribute the package to EPEL
  • wait until someone else undertakes one of the above

But non of these should be relevant to making design choices for the milter package.

whyscream avatar Sep 28 '16 08:09 whyscream

Yeah, between porting to lxml, and making html.parser work, lxml seems like the most efficient way to proceed. The requirement for any solution should be for SGMLFilter to keep the same API. That will preserve compatibility with existing milters.

sdgathman avatar Sep 28 '16 15:09 sdgathman

Reading through the lxml docs, it is unacceptable for milter applications. It apparently only know how to build trees from sax events, and write out the tree again. What we have to have for most milter applications is the SAX events. So lxml is out - back to the drawing board.

sdgathman avatar Sep 29 '16 03:09 sdgathman

I solved this for now by porting sgmllib to python3, and including it in the python3 package. The long term solution is to make xml.sax work, or find a good sax api library.

sdgathman avatar Sep 29 '16 20:09 sdgathman

It looks like xml.parsers.expat might be able to do the job. There are wrappers to harden it against malicious XML, but when you are not building a tree, the risk is low anyway. We'll need more test cases for SGMLFilter that verify it a) doesn't crash b) leaves content unchanged (except for overridden handlers)

sdgathman avatar Oct 02 '16 02:10 sdgathman