python-webencodings
python-webencodings copied to clipboard
Support getstate and setstate on IncrementalEncoder/Decoder
Python 3 introduces a getstate/setstate method pair on the incremental encoder/decoders. It would be nice to expose this, even if only on Py3.
This would be trivial for IncrementalEncoder and probably possible for IncrementalDecoder, but why is it useful?
https://github.com/gsnedders/html5lib-python/commit/d214d0dc930fd62ac1cbe719d80b9fdcb92a50ae uses it for changing encoding while parsing, which is needed to be compliant with HTML. It's hard to quite get the right behaviour without it.
webencodings.IncrementalDecoder looks for a BOM at the beginning of the input and picks the used encoding based on that. Does it make sense in html5lib’s context of changing encodings while parsing?
If you remove BOM stuff, webencodings.IncrementalDecoder(encoding, errors) is just a wrapper for encoding.codec_info.incrementaldecoder(errors), which does implement getstate/setstate.
html5lib could only use webencodings.lookup to get the right labels, ignore the rest of webencodings, and use Python’s APIs for the actual decoding.
Yeah, I guess.