snowflake-connector-python icon indicating copy to clipboard operation
snowflake-connector-python copied to clipboard

SNOW-720936: Arrays with nulls return invalid JSON for null values

Open cpcloud opened this issue 3 years ago • 8 comments

  1. What version of Python are you using?
❯ python --version --version
Python 3.10.9 (main, Dec  6 2022, 18:44:57) [GCC 11.3.0]
  1. What operating system and processor architecture are you using?
❯ python -c 'import platform; print(platform.platform())'
Linux-5.15.86-x86_64-with-glibc2.35
  1. What are the component versions in the environment (pip freeze)?
❯ poetry show snowflake-sqlalchemy
 name         : snowflake-sqlalchemy
 version      : 1.4.4
 description  : Snowflake SQLAlchemy Dialect

❯ poetry show snowflake-connector-python
 name         : snowflake-connector-python
 version      : 2.9.0
 description  : Snowflake Connector for Python
  1. What did you do?
In [19]: import sqlalchemy as sa

In [20]: con = sa.create_engine(...)

In [21]: [(row,)] = con.execute('select array_construct(1,2,null)')

In [22]: row
Out[22]: '[\n  1,\n  2,\n  undefined\n]'

In [23]: import json

In [24]: json.loads(row)
  1. What did you expect to see?

    I think I'd expect to see null (and then None from the result of the parse done by json.loads). null is a valid JSON value while undefined is not.

  2. Can you set logging to DEBUG and collect the logs?

    I don't think doing this helps, as the query succeeds.

cpcloud avatar Jan 04 '23 15:01 cpcloud

Interestingly, if I instead spell NULL as parse_json('null') then valid JSON is returned:

In [20]: [(row,)] = con.execute("select array_construct(1,2,parse_json('null'))")

In [21]: row
Out[21]: '[\n  1,\n  2,\n  null\n]'

In [22]: import json

In [23]: json.loads(row)
Out[23]: [1, 2, None]

cpcloud avatar Jan 04 '23 15:01 cpcloud

I guess this is how the nuance of SQL NULL versus the null VARIANT value manifests.

cpcloud avatar Jan 04 '23 15:01 cpcloud

It seems like undefined is being co-opted as the value for SQL's NULL inside JSON, which means it's probably hardwired into snowflake itself.

In that case, the only way to handle this would likely be in the Python client. In theory, you can do this by subclassing JSONDecoder and providing a custom scan_once attribute.

Here's a Python implementation of scan_once.

cpcloud avatar Jan 04 '23 16:01 cpcloud

It looks like #1379 would address this issue.

cpcloud avatar Jan 04 '23 16:01 cpcloud

@cpcloud thank you for your very deep triage and super meaningful bug report! ❤️

Your timing couldn't be any more spot on 😄 As you have already found we have recently found the very same issue (hence #1379 ) . It's actually caused by sparse arrays.

It seems like undefined is being co-opted as the value for SQL's NULL inside JSON, which means it's probably hardwired into snowflake itself.

So null means nothing, while undefined means missing value. In a sparse array usually missing means the default/most common value I believe. So they are not the same.

In that case, the only way to handle this would likely be in the Python client. In theory, you can do this by subclassing JSONDecoder and providing a custom scan_once attribute.

👀 #1371 #1379

I'd say that passing null to array_construct should NOT be interpreted as undefined (or at least there should be an easier way to add null to an array than the workaround you found). The only other workaround that we found was to use use parse_json on a json string instead of using array_construct. We are double-checking with someone from the server-side team (although this behavior is documented here). Please stay tuned!

sfc-gh-mkeller avatar Jan 04 '23 20:01 sfc-gh-mkeller

@sfc-gh-mkeller I have done some prototyping with ibis against #1371 and the experience is much improved: arrays, objects and variants are usable with #1371 and extremely difficult to use without. #1379 would also work and is probably the way to go, but I'd be happy with either of those.

cpcloud avatar Jan 17 '23 15:01 cpcloud

If anyone comes to the issue wondering how to workaround this, we've implemented full support for decoding array, object and variant types over in the ibis project and its Snowflake backend

Here's an example of using those types:

In [7]: con = ibis.connect(os.environ["SNOWFLAKE_URL"])

In [8]: con.tables
Out[8]:
Tables
------
- array_types
- awards_players
- batting
- diamonds
- functional_alltypes
- json_t
- map
- struct
- win

In [9]: con.tables.array_types
Out[9]:
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ x                    ┃ y                    ┃ z                    ┃ grouper ┃ scalar_column ┃ multi_dim              ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩
│ array<json>          │ array<json>          │ array<json>          │ string  │ float64       │ array<json>            │
├──────────────────────┼──────────────────────┼──────────────────────┼─────────┼───────────────┼────────────────────────┤
│ [1, 2, ... +1]       │ ['a', 'b', ... +1]   │ [1, 2, ... +1]       │ a       │           1.0 │ [[...], [...], ... +1] │
│ [4, 5]               │ ['d', 'e']           │ [4, 5]               │ a       │           2.0 │ []                     │
│ [6, None]            │ ['f', None]          │ [6, None]            │ a       │           3.0 │ [None, [...], ... +1]  │
│ [None, 1, ... +1]    │ [None, 'a', ... +1]  │ []                   │ b       │           4.0 │ [[...], [...], ... +2] │
│ [4, None, ... +2]    │ ['d', None, ... +2]  │ [4, None, ... +2]    │ c       │           6.0 │ [[...]]                │
│ [2, None, ... +1]    │ ['b', None, ... +1]  │ ∅                    │ b       │           5.0 │ ∅                      │
└──────────────────────┴──────────────────────┴──────────────────────┴─────────┴───────────────┴────────────────────────┘

In [10]: con.tables.array_types.x[0]
Out[10]:
┏━━━━━━━━━━━━━━━━━━┓
┃ ArrayIndex(x, 0) ┃
┡━━━━━━━━━━━━━━━━━━┩
│ json             │
├──────────────────┤
│ 1.0              │
│ 4.0              │
│ 6.0              │
│ ∅                │
│ 4.0              │
│ 2.0              │
└──────────────────┘

In [11]: con.tables.array_types.x[0].cast("int").sum()
Out[11]: 17

cpcloud avatar Jan 25 '23 21:01 cpcloud

Assigning to @sfc-gh-mkeller to update post internal convos. This is more of a feature request due to how the backend has been implemented so identifying it as such

sfc-gh-achandrasekaran avatar Mar 29 '23 21:03 sfc-gh-achandrasekaran