ibis
ibis copied to clipboard
feat(bigquery): `ArrayValue.as_table(offset_name: str | None)` as a table-valued function
Is your feature request related to a problem?
It's difficult for me to do some things that I'm used to doing in BigQuery, such as UNNEST([...]) on a bunch of literal values or a generated array, similar to how memtable worked in Ibis 6.x.
Also, the current implementation of unnest() doesn't work on deeply nested arrays of structs of arrays in the current implementation. An alternative API closer to how BQ semantics work for unnest would be really useful, IMO.
Describe the solution you'd like
As an alternative to the current unnest() semantics, which transforms an ArrayColumn into XColumn, I'd like to see ArrayValue.as_table() which works similarly to UNNEST in BigQuery, where it's primarily used in a FROM clause, often in a correlated join but not always.
I suspect the "treat this unnested array as a table with/without offsets" could be doable in other engines besides BigQuery, especially if correlated joins are ignored for now.
What version of ibis are you running?
7.1.0
What backend(s) are you using, if any?
BigQuery
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Edit: Updated to as_table to reflect similar function in StructValue.
There's a few potential use cases embedded in this request that would be unlocked. I use the following randomly-generated newline-delimited JSON file to demonstrate these use cases:
https://gist.github.com/tswast/f27c1a6082c54150e6353c9f6a2bd423
import ibis
bq = ibis.bigquery.connect(project_id="swast-scratch")
table = bq.table("swast-scratch.ibis7781_array_as_table.doubly_nested")
include offset so that ordering can be maintained post-unnest
It's not possible to express that I want to save the original order, such as with the WITH OFFSET AS clause for UNNEST in BigQuery. With the current version of unnest, it only returns a single column. I suppose it would be possible to add a with_offset parameter here, but then the return type might have to be a struct column or maybe a tuple of two columns.
SELECT
event_order_id, event
FROM `swast-scratch.ibis7781_array_as_table.doubly_nested` t0
CROSS JOIN
UNNEST(t0.event_sequence) AS event
WITH OFFSET AS event_order_id
ORDER BY
customer_id, day, flag, event_order_id
Proposed interface:
events = table.event_sequence.as_table(offset_name="event_order_id")
joined = table.cross_join(events)
result = joined[
events["event_order_id"], events["event"]
].order_by([
ibis.asc(table.customer_id),
ibis.asc(table.day),
ibis.asc(table.flag),
ibis.asc(events.event_order_id),
])
unnest deeply nested columns
The following should be possible, but currently fails.
level_1 = table.event_sequence.unnest()
level_2 = level_1["data"].unnest()
print(bq.compile(level_2))
bq.execute(level_2)
Fails with BadRequest: 400 Syntax error: Expected ")" but got keyword UNNEST at [3:114], and it generates the following SQL:
SELECT
IF(pos = pos_2, `data`, NULL) AS `data`
FROM `swast-scratch`.ibis7781_array_as_table.doubly_nested AS t0, UNNEST(GENERATE_ARRAY(0, GREATEST(ARRAY_LENGTH(UNNEST(t0.`event_sequence`).`data`)) - 1)) AS pos
CROSS JOIN UNNEST(UNNEST(t0.`event_sequence`).`data`) AS `data` WITH OFFSET AS pos_2
WHERE
pos = pos_2
OR (
pos > (
ARRAY_LENGTH(UNNEST(t0.`event_sequence`).`data`) - 1
)
AND pos_2 = (
ARRAY_LENGTH(UNNEST(t0.`event_sequence`).`data`) - 1
)
)
I believe representing the array as a table expression would allow for a more direct translation to BigQuery SQL.
Proposed interface:
events = table.event_sequence.as_table()
level_1 = table.cross_join(events)[
table["customer_id"], events["data"], events["timestamp"]
]
data = level_1["data"].as_table()
level_2 = level_1.cross_join(data)[
level_1["customer_id"], level_1["timestamp"], data["key"], data["value"]
]
This would generate SQL like the following:
SELECT
customer_id,
event_timestamp,
data.key AS `data_key`,
data.value AS `data_value`
FROM (
SELECT
customer_id,
event.timestamp AS `event_timestamp`,
DATA
FROM
`swast-scratch.ibis7781_array_as_table.doubly_nested` t0
CROSS JOIN
UNNEST(t0.event_sequence) AS event ) t1
CROSS JOIN
UNNEST(t1.data) AS data
Or more simply, all in one cross join:
events = table.event_sequence.as_table()
data = events["data"].as_table()
result = table.cross_join(events, data)[
table["customer_id"], events["timestamp"], data["key"], data["value"]
]
Which would generate the following SQL:
SELECT
customer_id,
event.timestamp as `event_timestamp`,
data.key as `data_key`,
data.value as `data_value`
FROM `swast-scratch.ibis7781_array_as_table.doubly_nested` t0
CROSS JOIN
UNNEST(t0.event_sequence) AS event
CROSS JOIN
UNNEST(event.data) AS data
using array literals as an alternative to memtable for cases where we explicitly want to embed the data in SQL
Previously (ibis 6.x), using memtable would generate SQL like the following:
SELECT
t0.*
FROM UNNEST(
ARRAY<STRUCT<`Column One` INT64, `Column 2` STRING>>[
STRUCT(1, 'hello'),
STRUCT(2, 'world'),
STRUCT(3, '!')
]) AS t0
Now memtable loads the data to a temporary table in the anonymous dataset, which is more flexible with regards to larger input sizes, but less efficient for very small arrays like this one. With the proposed API, we should be able to generate this SQL again with the following syntax:
array = ibis.array([
ibis.struct({"Column 1": 1, "Column 2": "hello"}),
ibis.struct({"Column 1": 2, "Column 2": "world"}),
ibis.struct({"Column 1": 3, "Column 2": "!"}),
])
table = array.as_table()
Note: this currently fails with unnest(), as there's no table expression in the tree:
bq.execute(array.unnest())
File /opt/miniconda3/envs/scratch/lib/python3.10/site-packages/ibis/expr/types/generic.py:1174, in Value.as_table(self)
1168 if len(roots) > 1:
1169 raise com.RelationError(
1170 f"Cannot convert {type(self)} expression "
1171 "involving multiple base table references "
1172 "to a projection"
1173 )
-> 1174 table = roots[0].to_expr()
1175 return table.select(self)
IndexError: list index out of range
One more use case:
keep rows after unnest with empty arrays
It's not always desired to eliminate rows where there are no values in the array. As seen in https://github.com/ibis-project/ibis/pull/7590, the default in pandas and snowflake is to preserve these empty arrays as NULL.
By treating arrays as tables, it's possible to keep rows with empty arrays by doing a LEFT JOIN instead of a CROSS JOIN.
SELECT
customer_id, day, event.timestamp as `event_timestamp`
FROM `swast-scratch.ibis7781_array_as_table.doubly_nested` t0
LEFT JOIN
UNNEST(t0.event_sequence) AS event
Returns 284 rows, compared to 268 with CROSS JOIN.
Proposed interface:
events = table.event_sequence.as_table()
joined = table.left_join(events)
result = joined[
table["customer_id"],
events["event"]["timestamp"],
]
array = ibis.array([ ibis.struct({"Column 1": 1, "Column 2": "hello"}), ibis.struct({"Column 1": 2, "Column 2": "world"}), ibis.struct({"Column 1": 3, "Column 2": "!"}), ]) table = array.as_table()Note: this currently fails with
unnest(), as there's no table expression in the tree:bq.execute(array.unnest())File /opt/miniconda3/envs/scratch/lib/python3.10/site-packages/ibis/expr/types/generic.py:1174, in Value.as_table(self) 1168 if len(roots) > 1: 1169 raise com.RelationError( 1170 f"Cannot convert {type(self)} expression " 1171 "involving multiple base table references " 1172 "to a projection" 1173 ) -> 1174 table = roots[0].to_expr() 1175 return table.select(self) IndexError: list index out of range
This shouldn't be a problem anymore (tried with both TES and main) since unbound scalar expressions are turned into a dummy table:
In [7]: array.as_table()
Out[7]:
DummyTable
ArrayColumn() array<struct<Column 1: int8, Column 2: string>>
I think this use case should be supported.
Let me know if there is more to do here, especially if this is the only blocker for you to migrate then I can prioritize this.
@kszucs Thanks for following up this issue. You're right. The bq.execute(array.unnest()) issue has been fixed.
However, we still needs other user cases to be supported. Here, @tswast is trying to propose a new API to fill the gaps of current unnest function. Below are my understanding from the BigQuery SQL aspect. Hope that can help you understand the issue as well.
Taken the array_types table from the test_unnest_no_nulls ibis test as example, ibis.to_sql(array_types.x.unnest()) can generate the BigQuery SQL as below:
SELECT
IF(pos = pos_2, `x`, NULL) AS `x`
FROM `bigframes-dev`.ibis_gbq_testing.array_types AS t0
CROSS JOIN UNNEST(GENERATE_ARRAY(0, GREATEST(ARRAY_LENGTH(t0.`x`)) - 1)) AS pos
CROSS JOIN UNNEST(t0.`x`) AS `x` WITH OFFSET AS pos_2
WHERE
pos = pos_2
OR (
pos > (
ARRAY_LENGTH(t0.`x`) - 1
) AND pos_2 = (
ARRAY_LENGTH(t0.`x`) - 1
)
)
The above SQL is generated by sqlglot by converting EXPLODE to UNNEST (related ibis commit). The SQL is a little bit nagging and can be simplified as below for better explaining:
SELECT x
FROM `bigframes-dev`.ibis_gbq_testing.array_types AS t0
CROSS JOIN UNNEST(t0.`x`) AS `x`
The mapped SQL for the proposed user case would be:
- Case 1: include offset so that ordering can be maintained post-unnest
The expected SQL would be (adding
WITH OFFSET AS pos):
SELECT x, pos
FROM `bigframes-dev`.ibis_gbq_testing.array_types AS t0
CROSS JOIN UNNEST(t0.`x`) AS `x`
WITH OFFSET AS pos
-
Case 2: unnest deeply nested columns Skip here as
array_typesdoes not have nested columns. -
Case 3: using array literals as an alternative to memtable for cases where we explicitly want to embed the data in SQL The issue is gone in the
unnest()API, but still need to be supported in the `as_table() API. -
Case 4: keep rows after unnest with empty arrays The expected SQL would be (switching
CROSS JOINtoLEFT JOIN):
SELECT x
FROM `bigframes-dev`.ibis_gbq_testing.array_types AS t0
LEFT JOIN UNNEST(t0.`x`) AS `x`
As I asked in the Zulip, I am trying to implement the proposed API but need some helps. Thanks for your patience here!