w3lib icon indicating copy to clipboard operation
w3lib copied to clipboard

Python's gb18030 decoder is not the same as w3c's

Open HyperHCl opened this issue 7 years ago • 5 comments

https://www.w3.org/TR/encoding/#gb18030-decoder specifies a single-byte special case 0x80 → U+20AC for gbk compatibility, but Python's decoder does not perform this translation.

HyperHCl avatar Oct 09 '16 06:10 HyperHCl

@HyperHCl , I'm not sure this is the right place to report this decoding issue. Have you submitted the issue to the Python Core developers?

redapple avatar Nov 10 '16 10:11 redapple

Well it's nearly clear that Python upstream will not accept this issue: they usually try to support the original national standard, not a w3c/whatwg web-standard. Python's codecs are quite pedantic, cf. ftfy "sloppy" encodings. To Python this problem is just the world doing things The Wrong Way, but to make codecs useful for them people have to make it as wrong as the rest of the world.

HyperHCl avatar Nov 10 '16 14:11 HyperHCl

@HyperHCl , I see. But where does this fit w3lib?

redapple avatar Nov 14 '16 16:11 redapple

By Googling for "whatwg encoding python" I found an implementation for that standard called webencodings. ~~I haven't actually verified how well it works (or whether it works at all) though.~~ Uh oops... It only provides a table of aliases that still points to Python's windows-1252 and gb18030. Sounds like time to invent a wheel -- say, w3lib.codecs or just a separate w3codecs.

Implementations for each codec in question:

  • Single-byte windows code pages: should be similar to ftfy's sloppy codecs.
  • gb18030 and gbk can be wrappers around Python's fast, native one:
    • gb18030 decoder:
      • as valid GBK/GB18030 text does not use 0x80 for anything but that single-byte euro sign, consider inputbytes.translate(bytes.maketrans(b'\x80', b'\xA2\xE3')). The same property may be used to construct a stream decoder and finally a complete one. alternatively,
      • wrap an error handler that handles 0x80 and carries on.
    • gbk encoder:
      • use a gb18030 encoder wrap that screams on seeing four-byte GB18030 UTF. alternatively:
      • an error handler around the gbk encoder that handles u'\u20AC'b'\x80'
  • Haven't looked into other MBCS's yet.

HyperHCl avatar Nov 14 '16 16:11 HyperHCl

Since this thread is labeled as discussion...

I think many Python web applications face this problem.

That is, since Pyhton codecs follow unicode.org spec, each developper has to invent how to support web's 'sloppy' encodings.

ftfy solves part of the problems, but just creating codecs following encoding.spec.whatwg seems the obvious solution, and actually ftfy author himself @rspeer proposed including them in stdlib. https://mail.python.org/pipermail/python-ideas/2018-January/048583.html

But aside from stdlib discussion, I couldn't find any other 3rd party libraries, popular solutions, or document or evidence that says it's not worth it if it is so. (At least w3lib doesn't do anything about it).

What people are thinking and doing?

openandclose avatar May 12 '20 19:05 openandclose