SNOW-720936: Arrays with nulls return invalid JSON for null values
- What version of Python are you using?
❯ python --version --version
Python 3.10.9 (main, Dec 6 2022, 18:44:57) [GCC 11.3.0]
- What operating system and processor architecture are you using?
❯ python -c 'import platform; print(platform.platform())'
Linux-5.15.86-x86_64-with-glibc2.35
- What are the component versions in the environment (
pip freeze)?
❯ poetry show snowflake-sqlalchemy
name : snowflake-sqlalchemy
version : 1.4.4
description : Snowflake SQLAlchemy Dialect
❯ poetry show snowflake-connector-python
name : snowflake-connector-python
version : 2.9.0
description : Snowflake Connector for Python
- What did you do?
In [19]: import sqlalchemy as sa
In [20]: con = sa.create_engine(...)
In [21]: [(row,)] = con.execute('select array_construct(1,2,null)')
In [22]: row
Out[22]: '[\n 1,\n 2,\n undefined\n]'
In [23]: import json
In [24]: json.loads(row)
-
What did you expect to see?
I think I'd expect to see
null(and thenNonefrom the result of the parse done byjson.loads).nullis a valid JSON value whileundefinedis not. -
Can you set logging to DEBUG and collect the logs?
I don't think doing this helps, as the query succeeds.
Interestingly, if I instead spell NULL as parse_json('null') then valid JSON is returned:
In [20]: [(row,)] = con.execute("select array_construct(1,2,parse_json('null'))")
In [21]: row
Out[21]: '[\n 1,\n 2,\n null\n]'
In [22]: import json
In [23]: json.loads(row)
Out[23]: [1, 2, None]
I guess this is how the nuance of SQL NULL versus the null VARIANT value manifests.
It seems like undefined is being co-opted as the value for SQL's NULL inside JSON, which means it's probably hardwired into snowflake itself.
In that case, the only way to handle this would likely be in the Python client. In theory, you can do this by subclassing JSONDecoder and providing a custom scan_once attribute.
Here's a Python implementation of scan_once.
It looks like #1379 would address this issue.
@cpcloud thank you for your very deep triage and super meaningful bug report! ❤️
Your timing couldn't be any more spot on 😄 As you have already found we have recently found the very same issue (hence #1379 ) . It's actually caused by sparse arrays.
It seems like undefined is being co-opted as the value for SQL's NULL inside JSON, which means it's probably hardwired into snowflake itself.
So null means nothing, while undefined means missing value. In a sparse array usually missing means the default/most common value I believe. So they are not the same.
In that case, the only way to handle this would likely be in the Python client. In theory, you can do this by subclassing JSONDecoder and providing a custom scan_once attribute.
👀 #1371 #1379
I'd say that passing null to array_construct should NOT be interpreted as undefined (or at least there should be an easier way to add null to an array than the workaround you found). The only other workaround that we found was to use use parse_json on a json string instead of using array_construct. We are double-checking with someone from the server-side team (although this behavior is documented here). Please stay tuned!
@sfc-gh-mkeller I have done some prototyping with ibis against #1371 and the experience is much improved: arrays, objects and variants are usable with #1371 and extremely difficult to use without. #1379 would also work and is probably the way to go, but I'd be happy with either of those.
If anyone comes to the issue wondering how to workaround this, we've implemented full support for decoding array, object and variant types over in the ibis project and its Snowflake backend
Here's an example of using those types:
In [7]: con = ibis.connect(os.environ["SNOWFLAKE_URL"])
In [8]: con.tables
Out[8]:
Tables
------
- array_types
- awards_players
- batting
- diamonds
- functional_alltypes
- json_t
- map
- struct
- win
In [9]: con.tables.array_types
Out[9]:
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ x ┃ y ┃ z ┃ grouper ┃ scalar_column ┃ multi_dim ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩
│ array<json> │ array<json> │ array<json> │ string │ float64 │ array<json> │
├──────────────────────┼──────────────────────┼──────────────────────┼─────────┼───────────────┼────────────────────────┤
│ [1, 2, ... +1] │ ['a', 'b', ... +1] │ [1, 2, ... +1] │ a │ 1.0 │ [[...], [...], ... +1] │
│ [4, 5] │ ['d', 'e'] │ [4, 5] │ a │ 2.0 │ [] │
│ [6, None] │ ['f', None] │ [6, None] │ a │ 3.0 │ [None, [...], ... +1] │
│ [None, 1, ... +1] │ [None, 'a', ... +1] │ [] │ b │ 4.0 │ [[...], [...], ... +2] │
│ [4, None, ... +2] │ ['d', None, ... +2] │ [4, None, ... +2] │ c │ 6.0 │ [[...]] │
│ [2, None, ... +1] │ ['b', None, ... +1] │ ∅ │ b │ 5.0 │ ∅ │
└──────────────────────┴──────────────────────┴──────────────────────┴─────────┴───────────────┴────────────────────────┘
In [10]: con.tables.array_types.x[0]
Out[10]:
┏━━━━━━━━━━━━━━━━━━┓
┃ ArrayIndex(x, 0) ┃
┡━━━━━━━━━━━━━━━━━━┩
│ json │
├──────────────────┤
│ 1.0 │
│ 4.0 │
│ 6.0 │
│ ∅ │
│ 4.0 │
│ 2.0 │
└──────────────────┘
In [11]: con.tables.array_types.x[0].cast("int").sum()
Out[11]: 17
Assigning to @sfc-gh-mkeller to update post internal convos. This is more of a feature request due to how the backend has been implemented so identifying it as such