lupa
lupa copied to clipboard
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Hi all, we would need some help regarding an issue we've been seeing in our lupa related code:
File "lupa/_lupa.pyx", line 308, in lupa._lupa.LuaRuntime.execute
File "lupa/_lupa.pyx", line 1324, in lupa._lupa.run_lua
File "lupa/_lupa.pyx", line 1333, in lupa._lupa.call_lua
File "lupa/_lupa.pyx", line 1358, in lupa._lupa.execute_lua_call
File "lupa/_lupa.pyx", line 281, in lupa._lupa.LuaRuntime.reraise_on_exception
File "lupa/_lupa.pyx", line 1496, in lupa._lupa.py_call_with_gil
File "lupa/_lupa.pyx", line 1459, in lupa._lupa.call_python
File "lupa/_lupa.pyx", line 1144, in lupa._lupa.py_from_lua
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
From what he have seen, this happens when the lua code calls a python function after having used string.sub on a string containing ’ (byte 153) or "”" (byte 226).
I've built a minimal repo that reproduces the issue: https://github.com/Step2Web/lupa-encoding-issue
We'd greatly appreciate some help in how to resolve this. Thank you in advance and please let me know if there's any other information you'll need.
Quick update here, I'm convinced now that this is actually a bug in Lua in how Unicode characters are handled. And using sub, we're splitting the unicode character, which breaks decoding in python later on.
Minimal reproduction: lua.eval('("’"):sub(1,1)')
´ is a compound and represented by multiple bytes.
string.sub ignores compounds and only takes the literal byte, which by itself are not valid UTF-8 (byte 153 or 226 are not valid).
You can set encoding=None when creating the LuaRuntime to disable decoding and get bytes instead.