superset icon indicating copy to clipboard operation
superset copied to clipboard

BYTEA column value decoding breaks preview and select. "'utf-8' codec can't decode byte 0xe6 in position 0: invalid continuation byte"

Open dvdotsenko opened this issue 1 year ago • 10 comments

Bug description

Preview and select functionalities break in SuperSet UI on columns of BYTEA type.

Database error
Failed to execute query '15' - 'SELECT 
  addressid,
  addressline1,
  city,
  stateprovinceid,
  postalcode,
  spatiallocation, -- <- BYTEA type triggering the error
  rowguid,
  modifieddate
from address 
limit 1
;': 'utf-8' codec can't decode byte 0xe6 in position 0: invalid continuation byte

Same query completes without error in Python console (following is ran inside an already running superset worker container):

>>> connection = engine.connect()
>>> q = 'select * from adventureworks.person.address limit 1'
>>> rows = connection.execute(text(q)).fetchall()
>>> rows
[(1, '1970 Napa Ct.', None, 'Bothell', 79, '98011', b'\xe6\x10\x00\x00\x01\x0c\xae\x8b\xfc(\xbc\xe4G@g\xa8\x91\x89\x89\x8a^\xc0', UUID('9aadcb0d-36cf-483f-84d8-585c2d4ec6e9'), datetime.datetime(2007, 12, 4, 0, 0))]

Table def in Postgres:

addressid INTEGER
addressline1 VARCHAR
addressline2 VARCHAR
city VARCHAR
stateprovinceid INTEGER
postalcode VARCHAR
spatiallocation BYTEA <- including this column in select causes error
rowguid UUID
modifieddate TIMESTAMP

That same table def as seen through Trino (presto):

addressid INTEGER
addressline1 VARCHAR
addressline2 VARCHAR
city VARCHAR
stateprovinceid INTEGER
postalcode VARCHAR
spatiallocation VARBINARY. <- including this column in select causes same error.
rowguid NULL
modifieddate TIMESTAMP

Superset version 4.0.0 (deployed per official Helm chart)

Per Chrome's Developer Tools > Network, the error is retuned with following call:

POST api/v1/sqllab/execute/

500 INTERNAL SERVER ERROR

Payload:

{"client_id":"kc-rU1IK7","database_id":2,"json":true,"runAsync":false,"schema":"public","sql":"SELECT\n  *\nfrom table_with_bytea \n;","sql_editor_id":"1","tab":"Untitled Query 1","tmp_table_name":"","select_as_cta":false,"ctas_method":"TABLE","queryLimit":1000,"expand_data":true}

Response:

{
    "errors": [
        {
            "message": "Failed to execute query '22' - 'SELECT\n  *\nfrom table_with_bytea \n;': 'utf-8' codec can't decode byte 0xbe in position 2: invalid start byte"
        }
    ]
}

Expected Behavior

Superset should not try to decode values of binary types for display purposes. Returning stringified hex is probably most meaningful.

For display purposes, showing first ~32 chars of HEX representation of the value is acceptable. Or, replace with something like <binary data>

How to reproduce the bug

In Postgres:

create table table_with_bytea
(
    working  varchar(10),
    breaking bytea
);

insert into table_with_bytea
values ('asdf', '\xDEADBEEF'::bytea)
;

-- run this in SuperSet:
select * from table_with_bytea;

Screenshots/recordings

Screenshot 2024-04-11 at 16 38 48

Superset version

4.0.0

Python version

I don't know

Node version

I don't know

Browser

Chrome

Additional context

official 4.0.0 images from docker hub apache/superset

Checklist

  • [X] I have searched Superset docs and Slack and didn't find a solution to my problem.
  • [X] I have searched the GitHub issue tracker and didn't find a similar bug report.
  • [ ] I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.

dvdotsenko avatar Apr 11 '24 23:04 dvdotsenko

Checked typical places in code where .decode( could be used and cannot replicate this issue in Python console (running within SuperSet worker container v.4.0.0)

My set up

from sqlalchemy import create_engine
from sqlalchemy.schema import Table, MetaData
from sqlalchemy.sql.expression import select, text

engine = create_engine('trino://user@trino:80/adventureworks')
connection = engine.connect()

cursor = connection.execute(text("SELECT * FROM adventureworks.public.table_with_bytea"))
data = cursor.fetchall() # base.py:937 fetch_data(): 

I mocked inputs best I could for following code entry points where I see decoding may be occuring explicitly or implicity:

execute_sql_statements
sql_lab.py:503  result_set = execute_sql_statement(
    return SupersetResultSet(data, ..) <- initing this with data from above did NOT produce error

result_set.py:103 SupersetResultSet

sql_lab.py:551  data, selected_columns, all_columns, expanded_columns = _serialize_and_expand_data(
        result_set, db_engine_spec, use_arrow_data, expand_data
    ) # <-- initing this with data from above did NOT produce error

execution_context_convertor = ExecutionContextConvertor()
  ExecuteSqlCommand.run()
# initing this with data from above did NOT produce error

Not sure at this point where that stray .decode('utf-8') error could be coming from, except for one other place I did not check:

built in Celery task payload deserialization code. I cannot find immediately where you are setting up default result serializer in your codebase for Celery. If it's JSON, and encoder you use is not the same custom one as in utilis, this decode exception may be coming from Celery task result deserializer layer somewhere...

dvdotsenko avatar Apr 12 '24 16:04 dvdotsenko

Do you have a stack trace? Does it get swallowed?

For the record I tried to recreate on docker-compose doing these steps:

# bash into the docker
$ docker-compose exec db bash

# fire up psql
$ PGPASSWORD=superset psql -U superset superset

psql> create table test (col BYTEA);
psql> insert into test values ('test');

but that worked... Screenshot 2024-04-17 at 4 27 43 PM

mistercrunch avatar Apr 17 '24 23:04 mistercrunch

FWIW, I've seen a similar issue with other connectors as well, specifically Trino and VARBINARY. I don't have steps to reproduce, but I agree with @dvdotsenko's theory on it happening somewhere when deserializing the result set.

@mistercrunch The values in the column matter, in the above example:

insert into table_with_bytea
values ('asdf', '\xDEADBEEF'::bytea)
;

I believe it's the actual value of the byte row, and not just an issue with all BYTEA columns, and since test is utf-8 compatible it doesn't have any issues. I've seen VARBINARY columns in Trino render fine, and other ones not with almost the exact same error message as OP

I think it might be related to this execution path:

The SQL Lab run function calls into this zlib_decompress function and dependening on the types / a msgpack config it will attempt to decode

whitleykeith avatar Apr 18 '24 19:04 whitleykeith

I think I found some previous issues that point to a potential fix

https://github.com/apache/superset/issues/8084 which mentions https://github.com/apache/superset/pull/5121/files fixed only sync queries and not async ones.

The link is broken, but I think they were referring to setting encoding=None on this line https://github.com/apache/superset/blob/master/superset/sql_lab.py#L353

whitleykeith avatar Apr 24 '24 19:04 whitleykeith

Having a stacktrace would help

mistercrunch avatar Apr 25 '24 00:04 mistercrunch

Is there a better way to find the stacktrace? This is all I can find so far:

superset 2024-04-25 13:19:44,743:INFO:superset.sql_lab:Query 129: Storing results in results backend, key: 6a059c1d-5242-48f1-b602-2a33d66fcdce
superset <ip> - - [25/Apr/2024:13:19:44 +0000] "POST /api/v1/sqllab/execute/ HTTP/1.1" 500 1137 "https://<redacted>/sqllab/"

I'm just deploying Superset via the Helm Chart with very little modifications

whitleykeith avatar Apr 25 '24 13:04 whitleykeith

Oh maybe it gets swallowed...

mistercrunch avatar Apr 25 '24 19:04 mistercrunch

I have this problem with Trino and varbinary columns. This is what I see in logs:

'utf-8' codec can't decode byte 0xd0 in position 1: invalid continuation byte
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1823, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/usr/local/lib/python3.10/site-packages/flask_appbuilder/security/decorators.py", line 95, in wraps
    return f(self, *args, **kwargs)
  File "/app/superset/views/base_api.py", line 127, in wraps
    raise ex
  File "/app/superset/views/base_api.py", line 121, in wraps
    duration, response = time_function(f, self, *args, **kwargs)
  File "/app/superset/utils/core.py", line 1463, in time_function
    response = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/flask_appbuilder/api/__init__.py", line 182, in wraps
    return f(self, *args, **kwargs)
  File "/app/superset/utils/log.py", line 255, in wrapper
    value = f(*args, **kwargs)
  File "/app/superset/sqllab/api.py", line 345, in get_results
    json.dumps(
  File "/usr/local/lib/python3.10/site-packages/simplejson/__init__.py", line 412, in dumps
    **kw).encode(obj)
  File "/usr/local/lib/python3.10/site-packages/simplejson/encoder.py", line 296, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/local/lib/python3.10/site-packages/simplejson/encoder.py", line 378, in iterencode
    return _iterencode(o, 0)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 1: invalid continuation byte
2024-04-29 14:51:09,329:ERROR:superset.views.base:'utf-8' codec can't decode byte 0xd0 in position 1: invalid continuation byte
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1823, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/usr/local/lib/python3.10/site-packages/flask_appbuilder/security/decorators.py", line 95, in wraps
    return f(self, *args, **kwargs)
  File "/app/superset/views/base_api.py", line 127, in wraps
    raise ex
  File "/app/superset/views/base_api.py", line 121, in wraps
    duration, response = time_function(f, self, *args, **kwargs)
  File "/app/superset/utils/core.py", line 1463, in time_function
    response = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/flask_appbuilder/api/__init__.py", line 182, in wraps
    return f(self, *args, **kwargs)
  File "/app/superset/utils/log.py", line 255, in wrapper
    value = f(*args, **kwargs)
  File "/app/superset/sqllab/api.py", line 345, in get_results
    json.dumps(
  File "/usr/local/lib/python3.10/site-packages/simplejson/__init__.py", line 412, in dumps
    **kw).encode(obj)
  File "/usr/local/lib/python3.10/site-packages/simplejson/encoder.py", line 296, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/local/lib/python3.10/site-packages/simplejson/encoder.py", line 378, in iterencode
    return _iterencode(o, 0)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 1: invalid continuation byte

derbysh avatar Apr 29 '24 14:04 derbysh

I was about to submit a patch and found the solution similar to what I was going to put together here: https://github.com/apache/superset/blob/master/superset/utils/core.py#L481-L485

And even noticed that this should at the very least catch: https://github.com/apache/superset/blob/master/superset/utils/core.py#L506-L512

Then I thought "maybe @derbysh is on an older build" but saw in git blame that the solutions above have been in place forever.

@derbysh if you have a set up where you can reproduce & debug, it'd be great to know the type of the variable that trips things here. I'm guessing it looks like a str to python (not bytes), but own't utf-8 decode properly.

Seems like:

  • maybe the driver isn't doing the right thing (maybe marking this as bytes would be the right thing for the driver to do?)
  • maybe we should call pessimistic_json_iso_dttm_ser from superset/sqllab/api.py line 145

mistercrunch avatar Apr 29 '24 16:04 mistercrunch

@derbysh there's some relevant information in https://github.com/apache/superset/pull/28266 where I state how I could use more input, or whether even that patch fixes your issue. If you're able to patch your instance with the PR, it may fix the issue and fill those cells with a useful "Unserialize type({TYPE})" information in the UI. I'm curious as to whether we have a custom type here, say a Trino.VarBinary, or whether it's a simple str that's not utf8-encodable.

In any case, we should handle both these cases gracefully.

mistercrunch avatar Apr 29 '24 19:04 mistercrunch

https://github.com/apache/superset/pull/28266 should fix the main issue here, though we could re-open a more specific issue around handling more specific types in specific database engines.

mistercrunch avatar May 06 '24 19:05 mistercrunch