lupa UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Hi all, we would need some help regarding an issue we've been seeing in our lupa related code:

  File "lupa/_lupa.pyx", line 308, in lupa._lupa.LuaRuntime.execute
  File "lupa/_lupa.pyx", line 1324, in lupa._lupa.run_lua
  File "lupa/_lupa.pyx", line 1333, in lupa._lupa.call_lua
  File "lupa/_lupa.pyx", line 1358, in lupa._lupa.execute_lua_call
  File "lupa/_lupa.pyx", line 281, in lupa._lupa.LuaRuntime.reraise_on_exception
  File "lupa/_lupa.pyx", line 1496, in lupa._lupa.py_call_with_gil
  File "lupa/_lupa.pyx", line 1459, in lupa._lupa.call_python
  File "lupa/_lupa.pyx", line 1144, in lupa._lupa.py_from_lua
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

From what he have seen, this happens when the lua code calls a python function after having used string.sub on a string containing ’ (byte 153) or "”" (byte 226).

I've built a minimal repo that reproduces the issue: https://github.com/Step2Web/lupa-encoding-issue

We'd greatly appreciate some help in how to resolve this. Thank you in advance and please let me know if there's any other information you'll need.

Aug 21 '22 22:08 Step2Web

Quick update here, I'm convinced now that this is actually a bug in Lua in how Unicode characters are handled. And using sub, we're splitting the unicode character, which breaks decoding in python later on.

Aug 22 '22 19:08 Step2Web

Minimal reproduction: lua.eval('("’"):sub(1,1)')

Sep 17 '22 20:09 Le0Developer

´ is a compound and represented by multiple bytes.

string.sub ignores compounds and only takes the literal byte, which by itself are not valid UTF-8 (byte 153 or 226 are not valid).

You can set encoding=None when creating the LuaRuntime to disable decoding and get bytes instead.

Sep 17 '22 20:09 Le0Developer