superset
superset copied to clipboard
BYTEA column value decoding breaks preview and select. "'utf-8' codec can't decode byte 0xe6 in position 0: invalid continuation byte"
Bug description
Preview and select functionalities break in SuperSet UI on columns of BYTEA type.
Database error
Failed to execute query '15' - 'SELECT
addressid,
addressline1,
city,
stateprovinceid,
postalcode,
spatiallocation, -- <- BYTEA type triggering the error
rowguid,
modifieddate
from address
limit 1
;': 'utf-8' codec can't decode byte 0xe6 in position 0: invalid continuation byte
Same query completes without error in Python console (following is ran inside an already running superset worker container):
>>> connection = engine.connect()
>>> q = 'select * from adventureworks.person.address limit 1'
>>> rows = connection.execute(text(q)).fetchall()
>>> rows
[(1, '1970 Napa Ct.', None, 'Bothell', 79, '98011', b'\xe6\x10\x00\x00\x01\x0c\xae\x8b\xfc(\xbc\xe4G@g\xa8\x91\x89\x89\x8a^\xc0', UUID('9aadcb0d-36cf-483f-84d8-585c2d4ec6e9'), datetime.datetime(2007, 12, 4, 0, 0))]
Table def in Postgres:
addressid INTEGER
addressline1 VARCHAR
addressline2 VARCHAR
city VARCHAR
stateprovinceid INTEGER
postalcode VARCHAR
spatiallocation BYTEA <- including this column in select causes error
rowguid UUID
modifieddate TIMESTAMP
That same table def as seen through Trino (presto):
addressid INTEGER
addressline1 VARCHAR
addressline2 VARCHAR
city VARCHAR
stateprovinceid INTEGER
postalcode VARCHAR
spatiallocation VARBINARY. <- including this column in select causes same error.
rowguid NULL
modifieddate TIMESTAMP
Superset version 4.0.0 (deployed per official Helm chart)
Per Chrome's Developer Tools > Network, the error is retuned with following call:
POST api/v1/sqllab/execute/
500 INTERNAL SERVER ERROR
Payload:
{"client_id":"kc-rU1IK7","database_id":2,"json":true,"runAsync":false,"schema":"public","sql":"SELECT\n *\nfrom table_with_bytea \n;","sql_editor_id":"1","tab":"Untitled Query 1","tmp_table_name":"","select_as_cta":false,"ctas_method":"TABLE","queryLimit":1000,"expand_data":true}
Response:
{
"errors": [
{
"message": "Failed to execute query '22' - 'SELECT\n *\nfrom table_with_bytea \n;': 'utf-8' codec can't decode byte 0xbe in position 2: invalid start byte"
}
]
}
Expected Behavior
Superset should not try to decode values of binary types for display purposes. Returning stringified hex is probably most meaningful.
For display purposes, showing first ~32 chars of HEX representation of the value is acceptable. Or, replace with something like <binary data>
How to reproduce the bug
In Postgres:
create table table_with_bytea
(
working varchar(10),
breaking bytea
);
insert into table_with_bytea
values ('asdf', '\xDEADBEEF'::bytea)
;
-- run this in SuperSet:
select * from table_with_bytea;
Screenshots/recordings
Superset version
4.0.0
Python version
I don't know
Node version
I don't know
Browser
Chrome
Additional context
official 4.0.0 images from docker hub apache/superset
Checklist
- [X] I have searched Superset docs and Slack and didn't find a solution to my problem.
- [X] I have searched the GitHub issue tracker and didn't find a similar bug report.
- [ ] I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.
Checked typical places in code where .decode( could be used and cannot replicate this issue in Python console (running within SuperSet worker container v.4.0.0)
My set up
from sqlalchemy import create_engine
from sqlalchemy.schema import Table, MetaData
from sqlalchemy.sql.expression import select, text
engine = create_engine('trino://user@trino:80/adventureworks')
connection = engine.connect()
cursor = connection.execute(text("SELECT * FROM adventureworks.public.table_with_bytea"))
data = cursor.fetchall() # base.py:937 fetch_data():
I mocked inputs best I could for following code entry points where I see decoding may be occuring explicitly or implicity:
execute_sql_statements
sql_lab.py:503 result_set = execute_sql_statement(
return SupersetResultSet(data, ..) <- initing this with data from above did NOT produce error
result_set.py:103 SupersetResultSet
sql_lab.py:551 data, selected_columns, all_columns, expanded_columns = _serialize_and_expand_data(
result_set, db_engine_spec, use_arrow_data, expand_data
) # <-- initing this with data from above did NOT produce error
execution_context_convertor = ExecutionContextConvertor()
ExecuteSqlCommand.run()
# initing this with data from above did NOT produce error
Not sure at this point where that stray .decode('utf-8') error could be coming from, except for one other place I did not check:
built in Celery task payload deserialization code. I cannot find immediately where you are setting up default result serializer in your codebase for Celery. If it's JSON, and encoder you use is not the same custom one as in utilis, this decode exception may be coming from Celery task result deserializer layer somewhere...
Do you have a stack trace? Does it get swallowed?
For the record I tried to recreate on docker-compose doing these steps:
# bash into the docker
$ docker-compose exec db bash
# fire up psql
$ PGPASSWORD=superset psql -U superset superset
psql> create table test (col BYTEA);
psql> insert into test values ('test');
but that worked...
FWIW, I've seen a similar issue with other connectors as well, specifically Trino and VARBINARY. I don't have steps to reproduce, but I agree with @dvdotsenko's theory on it happening somewhere when deserializing the result set.
@mistercrunch The values in the column matter, in the above example:
insert into table_with_bytea
values ('asdf', '\xDEADBEEF'::bytea)
;
I believe it's the actual value of the byte row, and not just an issue with all BYTEA columns, and since test is utf-8 compatible it doesn't have any issues. I've seen VARBINARY columns in Trino render fine, and other ones not with almost the exact same error message as OP
I think it might be related to this execution path:
The SQL Lab run function calls into this zlib_decompress function and dependening on the types / a msgpack config it will attempt to decode
I think I found some previous issues that point to a potential fix
https://github.com/apache/superset/issues/8084 which mentions https://github.com/apache/superset/pull/5121/files fixed only sync queries and not async ones.
The link is broken, but I think they were referring to setting encoding=None on this line https://github.com/apache/superset/blob/master/superset/sql_lab.py#L353
Having a stacktrace would help
Is there a better way to find the stacktrace? This is all I can find so far:
superset 2024-04-25 13:19:44,743:INFO:superset.sql_lab:Query 129: Storing results in results backend, key: 6a059c1d-5242-48f1-b602-2a33d66fcdce
superset <ip> - - [25/Apr/2024:13:19:44 +0000] "POST /api/v1/sqllab/execute/ HTTP/1.1" 500 1137 "https://<redacted>/sqllab/"
I'm just deploying Superset via the Helm Chart with very little modifications
Oh maybe it gets swallowed...
I have this problem with Trino and varbinary columns.
This is what I see in logs:
'utf-8' codec can't decode byte 0xd0 in position 1: invalid continuation byte
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1823, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1799, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "/usr/local/lib/python3.10/site-packages/flask_appbuilder/security/decorators.py", line 95, in wraps
return f(self, *args, **kwargs)
File "/app/superset/views/base_api.py", line 127, in wraps
raise ex
File "/app/superset/views/base_api.py", line 121, in wraps
duration, response = time_function(f, self, *args, **kwargs)
File "/app/superset/utils/core.py", line 1463, in time_function
response = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/flask_appbuilder/api/__init__.py", line 182, in wraps
return f(self, *args, **kwargs)
File "/app/superset/utils/log.py", line 255, in wrapper
value = f(*args, **kwargs)
File "/app/superset/sqllab/api.py", line 345, in get_results
json.dumps(
File "/usr/local/lib/python3.10/site-packages/simplejson/__init__.py", line 412, in dumps
**kw).encode(obj)
File "/usr/local/lib/python3.10/site-packages/simplejson/encoder.py", line 296, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/local/lib/python3.10/site-packages/simplejson/encoder.py", line 378, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 1: invalid continuation byte
2024-04-29 14:51:09,329:ERROR:superset.views.base:'utf-8' codec can't decode byte 0xd0 in position 1: invalid continuation byte
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1823, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1799, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "/usr/local/lib/python3.10/site-packages/flask_appbuilder/security/decorators.py", line 95, in wraps
return f(self, *args, **kwargs)
File "/app/superset/views/base_api.py", line 127, in wraps
raise ex
File "/app/superset/views/base_api.py", line 121, in wraps
duration, response = time_function(f, self, *args, **kwargs)
File "/app/superset/utils/core.py", line 1463, in time_function
response = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/flask_appbuilder/api/__init__.py", line 182, in wraps
return f(self, *args, **kwargs)
File "/app/superset/utils/log.py", line 255, in wrapper
value = f(*args, **kwargs)
File "/app/superset/sqllab/api.py", line 345, in get_results
json.dumps(
File "/usr/local/lib/python3.10/site-packages/simplejson/__init__.py", line 412, in dumps
**kw).encode(obj)
File "/usr/local/lib/python3.10/site-packages/simplejson/encoder.py", line 296, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/local/lib/python3.10/site-packages/simplejson/encoder.py", line 378, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 1: invalid continuation byte
I was about to submit a patch and found the solution similar to what I was going to put together here: https://github.com/apache/superset/blob/master/superset/utils/core.py#L481-L485
And even noticed that this should at the very least catch: https://github.com/apache/superset/blob/master/superset/utils/core.py#L506-L512
Then I thought "maybe @derbysh is on an older build" but saw in git blame that the solutions above have been in place forever.
@derbysh if you have a set up where you can reproduce & debug, it'd be great to know the type of the variable that trips things here. I'm guessing it looks like a str to python (not bytes), but own't utf-8 decode properly.
Seems like:
- maybe the driver isn't doing the right thing (maybe marking this as bytes would be the right thing for the driver to do?)
- maybe we should call
pessimistic_json_iso_dttm_serfromsuperset/sqllab/api.pyline 145
@derbysh there's some relevant information in https://github.com/apache/superset/pull/28266 where I state how I could use more input, or whether even that patch fixes your issue. If you're able to patch your instance with the PR, it may fix the issue and fill those cells with a useful "Unserialize type({TYPE})" information in the UI. I'm curious as to whether we have a custom type here, say a Trino.VarBinary, or whether it's a simple str that's not utf8-encodable.
In any case, we should handle both these cases gracefully.
https://github.com/apache/superset/pull/28266 should fix the main issue here, though we could re-open a more specific issue around handling more specific types in specific database engines.