pytidylib
pytidylib copied to clipboard
UnicodeDecodeError with call to tidy_document
This may seem a bit simplistic but I couldn't figure out a way how to reproduce this manually, so maybe you have an idea how to fix the following traceback.
Traceback (most recent call last):
File "_ctypes/callbacks.c", line 314, in 'calling callback function'
File "/home/vagrant/src/vendor/src/pytidylib/tidylib/sink.py", line 79, in put_byte
write_func(byte.decode('utf-8'))
File "/home/vagrant/env/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError
Sorry for the odd formatting but that's all I got from a Celery task that runs a call to tidy_document: https://github.com/mozilla/kuma/blob/8693de789413ef81e74da6d1f02aa39421eb611b/kuma/wiki/helpers.py#L92-L95
Others seem to have a similar problem and have worked around it: https://github.com/1flow/python-ftr/blob/90a2108c5ee005f1bf66dbe8cce68f2b7051b839/ftr/extractor.py#L146-L154
Do you know what's causing this?
Sorry I lost track of this. I do have an idea of what could be happening but it'd be easier with a sample input document that triggers the error as of course none of the existing tests catch it.
@countergram The value in question is 0xc3 0xa9, which should be é. For some reason it stumbles over it though. Here's a better traceback:
Traceback (most recent call last):
File "_ctypes/callbacks.c", line 314, in 'calling callback function'
File "/home/vagrant/src/vendor/src/pytidylib/tidylib/sink.py", line 79, in put_byte
write_func(byte.decode('utf-8'))
File "/home/vagrant/env/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: unexpected end of data
Traceback (most recent call last):
File "_ctypes/callbacks.c", line 314, in 'calling callback function'
File "/home/vagrant/src/vendor/src/pytidylib/tidylib/sink.py", line 79, in put_byte
write_func(byte.decode('utf-8'))
File "/home/vagrant/env/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: invalid start byte
FWIW, this also happens outside of Vagrant btw, in production.
Ignoring those characters "fixed" the issue for me in local testing FWIW: https://gist.github.com/jezdez/579e6a30d85c2ced042a
Have it too
Yes, got it too since it has migrated in Debian where it breaks the rawdog RSS feed reader.
Traceback (most recent call last):
File "_ctypes/callbacks.c", line 314, in 'calling callback function'
File "/usr/lib/python2.7/dist-packages/tidylib/sink.py", line 79, in put_byte
write_func(byte.decode('utf-8'))
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: invalid start byte
The cause of this for rawdog was that libtidy 0.99 used ASCII as its default encoding, and libtidy 5 uses UTF-8. rawdog relied on the default and didn't expect to get a UTF-8 encoded result. I fixed this in rawdog 2.22 by explicitly specifying the input and output encodings, so it works the same way on all versions. So it's not pytidylib's fault, it's a tidy bug.
I'm not sure the problem the original poster is seeing is the same thing, though...