w3lib
w3lib copied to clipboard
Python's gb18030 decoder is not the same as w3c's
https://www.w3.org/TR/encoding/#gb18030-decoder specifies a single-byte special case 0x80 → U+20AC for gbk compatibility, but Python's decoder does not perform this translation.
@HyperHCl , I'm not sure this is the right place to report this decoding issue. Have you submitted the issue to the Python Core developers?
Well it's nearly clear that Python upstream will not accept this issue: they usually try to support the original national standard, not a w3c/whatwg web-standard. Python's codecs are quite pedantic, cf. ftfy "sloppy" encodings. To Python this problem is just the world doing things The Wrong Way, but to make codecs useful for them people have to make it as wrong as the rest of the world.
@HyperHCl , I see. But where does this fit w3lib?
By Googling for "whatwg encoding python" I found an implementation for that standard called webencodings. ~~I haven't actually verified how well it works (or whether it works at all) though.~~ Uh oops... It only provides a table of aliases that still points to Python's windows-1252
and gb18030
. Sounds like time to invent a wheel -- say, w3lib.codecs
or just a separate w3codecs
.
Implementations for each codec in question:
- Single-byte windows code pages: should be similar to ftfy's sloppy codecs.
-
gb18030
andgbk
can be wrappers around Python's fast, native one:-
gb18030
decoder:- as valid GBK/GB18030 text does not use
0x80
for anything but that single-byte euro sign, considerinputbytes.translate(bytes.maketrans(b'\x80', b'\xA2\xE3'))
. The same property may be used to construct a stream decoder and finally a complete one. alternatively, - wrap an error handler that handles
0x80
and carries on.
- as valid GBK/GB18030 text does not use
-
gbk
encoder:- use a
gb18030
encoder wrap that screams on seeing four-byte GB18030 UTF. alternatively: - an error handler around the
gbk
encoder that handlesu'\u20AC'
→b'\x80'
- use a
-
- Haven't looked into other MBCS's yet.
Since this thread is labeled as discussion...
I think many Python web applications face this problem.
That is, since Pyhton codecs follow unicode.org
spec,
each developper has to invent how to support web's 'sloppy' encodings.
ftfy solves part of the problems,
but just creating codecs following encoding.spec.whatwg
seems the obvious solution,
and actually ftfy author himself @rspeer proposed including them in stdlib.
https://mail.python.org/pipermail/python-ideas/2018-January/048583.html
But aside from stdlib discussion, I couldn't find any other 3rd party libraries, popular solutions, or document or evidence that says it's not worth it if it is so. (At least w3lib doesn't do anything about it).
What people are thinking and doing?