ibis icon indicating copy to clipboard operation
ibis copied to clipboard

fix: disallow empty struct

Open chloeh13q opened this issue 1 year ago • 1 comments

Description of changes

Disallow empty struct in type inference. Pyarrow allows the construction of empty structs but duckdb doesn't.

I think what the user in #8460 was attempting to create was an empty value in a struct column that has defined fields. You can accomplish this with the following:

>>> import ibis
>>>
>>> ibis.options.interactive = True
>>>
>>> t = ibis.memtable(
...             [
...                 {
...                     "year": "2024",
...                     "period": "M01",
...                     "periodName": "January",
...                     "latest": "true",
...                     "value": "3.7",
...                     "footnotes": None,
...                 }
...             ],
...             schema={"year": "string", "period": "string", "periodName": "string", "latest": "string", "value": "string", "footnotes": "struct<key: string>"}
...         )
>>> t
┏━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┓
┃ year   ┃ period ┃ periodName ┃ latest ┃ value  ┃ footnotes ┃
┡━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━┩
│ string │ string │ string     │ string │ string │ struct<k… │
├────────┼────────┼────────────┼────────┼────────┼───────────┤
│ 2024   │ M01    │ January    │ true   │ 3.7    │ NULL      │
└────────┴────────┴────────────┴────────┴────────┴───────────┘

I do want to note here that ibis interprets {} as a struct whose keys point to empty values rather than a struct that is empty:

>>> t = ibis.memtable(
...             [
...                 {
...                     "year": "2024",
...                     "period": "M01",
...                     "periodName": "January",
...                     "latest": "true",
...                     "value": "3.7",
...                     "footnotes": {},
...                 }
...             ],
...             schema={"year": "string", "period": "string", "periodName": "string", "latest": "string", "value": "string", "footnotes": "struct<key: string>"}
...         )
>>> t
┏━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ year   ┃ period ┃ periodName ┃ latest ┃ value  ┃ footnotes            ┃
┡━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ string │ string │ string     │ string │ string │ struct<key: string>  │
├────────┼────────┼────────────┼────────┼────────┼──────────────────────┤
│ 2024   │ M01    │ January    │ true   │ 3.7    │ {'key': None}        │
└────────┴────────┴────────────┴────────┴────────┴──────────────────────┘

Issues closed

#8460

chloeh13q avatar Apr 03 '24 07:04 chloeh13q

It seems to me like {} does not properly follow the schema of "struct<key: string>", and so ibis.literal({}, type="struct<key: string>") should just fail. The user's input is malformed, and they will need to deal with it external to ibis. Or, perhaps they could use a map<string, string> type instead, which properly represents a datatype with a variable number of keys.

IDK, maybe we could make ibis lenient, but it feels like a slippery slope. If we allow this, then it feels like we should make it so that ibis.struct({"a": "foo"}).cast("struct<a: string, b:int>") to succeed, and result in {"a": "foo", "b": None} (ie if you cast something to a struct, anytime a field isn't present, fill it in with NULL). This feels OK to me, but I don't think backends have good builtin support for this, we would have to add workarounds all over the place.

NickCrews avatar Apr 03 '24 18:04 NickCrews