ibis
ibis copied to clipboard
feat(bigquery): Include `offset` parameter to unnest
Is your feature request related to a problem?
It's not possible to express that I want to save the original order, such as with the WITH OFFSET AS clause for UNNEST in BigQuery. With the current version of unnest, it only returns a single column. I suppose it would be possible to add a with_offset parameter here, but then the return type might have to be a struct column or maybe a tuple of two columns.
This issue has the same request as #7781
Describe the solution you'd like
I'd like to add an offset parameter to the unnest function. When offset is True, the compiler can returns a sge.Posexplode() rather than sge.Explode() node. Then, sqlglot can handle the SQL translation as expected. Here is the draft PR: https://github.com/ibis-project/ibis/pull/8354/files
With that change, the translated SQL looks good to me:
In [1]: import ibis
...: bq = ibis.bigquery.connect(project_id="bigframes-dev")
...: table = bq.table("bigframes-dev.ibis_gbq_testing.array_types")
...: ibis.to_sql(table.z.unnest(offset=True), dialect="bigquery")
Out[1]:
SELECT
IF(pos = pos_2, `z`, NULL) AS `z`,
IF(pos = pos_2, pos_2, NULL) AS pos_2
FROM `bigframes-dev`.`ibis_gbq_testing`.`array_types` AS `t0`
CROSS JOIN UNNEST(GENERATE_ARRAY(0, GREATEST(ARRAY_LENGTH(`t0`.`z`)) - 1)) AS pos
CROSS JOIN UNNEST(`t0`.`z`) AS `z` WITH OFFSET AS pos_2
WHERE
pos = pos_2
OR (
pos > (
ARRAY_LENGTH(`t0`.`z`) - 1
) AND pos_2 = (
ARRAY_LENGTH(`t0`.`z`) - 1
)
)
However, the ibis node type is not correct. The ibis assumes this node only returns a single value and ignore the pos_2 here. Here is the repro code:
In [3]: type(table.z.unnest(offset=True))
Out[3]: ibis.expr.types.numeric.FloatingColumn
In [7]: table.select(["scalar_column", table.z.unnest(offset=True)])
Out[7]:
r0 := DatabaseTable: array_types
x array<!int64>
y array<!string>
z array<!float64>
grouper string
scalar_column float64
Project[r0]
scalar_column: r0.scalar_column
z: Unnest(r0.z, offset=True)
In [10]: table.select(["scalar_column", table.z.unnest(offset=True)]).pos_2
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[10], line 1
----> 1 table.select(["scalar_column", table.z.unnest(offset=True)]).pos_2
File ~/src/forkibis2/ibis/expr/types/relations.py:859, in Table.__getattr__(self, key)
854 hint = common_typos[key]
855 raise AttributeError(
856 f"{type(self).__name__} object has no attribute {key!r}, did you mean {hint!r}"
857 )
--> 859 raise AttributeError(f"'Table' object has no attribute {key!r}")
AttributeError: 'Table' object has no attribute 'pos_2'
Any suggestions to correct the change so that ibis can return pos_2 as the offset?
What version of ibis are you running?
8.0.0
What backend(s) are you using, if any?
BigQuery
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Let's work through the API design first before jumping into compiled SQL.
The problem we need to solve to expose offset functionality may or may not have anything to do with the details of the compiled SQL.
Thinking about this some more I think this functionality warrants a new API.
I'll propose one: ArrayValue.enumerate(start=0).
This API would be analogous to Python's built-in function enumerate which accepts an Iterable[T] and returns Iterable[tuple[int, T]].
We'd do something similar, so the full signature of enumerate would be (using pseudocode):
# for ArrayValue[T]
def enumerate(self, start: int = 0) -> dt.StructValue[index: int, value: T]:
...
Thoughts?
We should probably experiment with two different flavors of backends (BigQuery and DuckDB) to determine feasibility of compilation here.
While the proposed enumerate is familier to Python users, it is important to consider how this will translate into SQL and work with database execution. Let's discuss a few concerns:
- Efficiency: If a user calls table.x.enumerate(0) and table.x.enumerate(1) in succession, could this generate two SQL queries to be executed?
- What's the equivalent API to
table.select([table["grouper"], table.x.unnest(offset=True)])? - Implement difficulty: I feel it's not easy to implement it (maybe I am too new to ibis expression tree). Currently, ibis expression tree presents
table.arr.unnest()ASSELECT EXPLODE(arr) FROM tableand letexplode_to_unnestsqlglot.transformer to compile it intoSELECT UNNEST(arr) FROM table CROSS JOIN UNNEST(arr)(not exactly but equivalent).
In case you didn't see that, @tswast also proposed a new API at #7781 calling ArrayTable.as_table(offset) if we want to keep current unnest API unchanged.
If we want to continue the implementation for unnest(offset=True) API, I am trying to find a solution to return a struct in sqlglot as you commented at #8354
The solution is asking ibis compiler returns SELECT (SELECT AS STRUCT POSEXPLODE(arr)) FROM table and sqlglot can convert it into SELECT (SELECT AS STRUCT x, pos) FROM table FROM table CROSS JOIN UNNEST(x). But sqlglot does not work as expected, and I am trying to thinking about the solution.
driveby-ing: def enumerate(arr: ArrayValue) feels very close to def zip(arr: ArrayValue, *others: ArrayValue), and I have wanted zip before but did workarounds.
@NickCrews We do have array zip, not sure if you've seen that: https://ibis-project.org/reference/expression-collections#ibis.expr.types.arrays.ArrayValue.zip
@chelsea-lin I'll try to address your concerns here and provide a path forward:
Efficiency: If a user calls table.x.enumerate(0) and table.x.enumerate(1) in succession, could this generate two SQL queries to be executed?
Yes. Is that different from any other expression that takes input arguments?
What's the equivalent API to table.select([table["grouper"], table.x.unnest(offset=True)])?
table.select("grouper", table.x.enumerate())
Implement difficulty: I feel it's not easy to implement it (maybe I am too new to ibis expression tree). Currently, ibis expression tree presents table.arr.unnest() AS SELECT EXPLODE(arr) FROM table and let explode_to_unnest sqlglot.transformer to compile it into SELECT UNNEST(arr) FROM table CROSS JOIN UNNEST(arr)(not exactly but equivalent).
Start by implementing the enumerate API and don't worry about sqlglot until you can produce an Ibis expression.
Here's the approach I would take:
- Define a new
ops.Valuesubclass (class Enumerate(Value)) inibis/expr/operations/arrays.py - Define an
enumeratemethod onArrayValue(inibis/expr/types/arrays.py) that constructsops.Enumerate(...).to_expr() - Figure out the SQL syntax that corresponds to the semantics of the operation
- Figure out the sqlglot objects that correspond to that SQL syntax.
@cpcloud thanks! The implementation looks good overall. My remaining question is about ops.Enumerate.
It needs to return a dt.Struct and expect a sqlglot expression equaling to STRUCT(POSEXPLODE(arr)). Unfortunately, this cannot be directly converted to BigQuery syntax correctly (please check tobymao/sqlglot#2978). Also the issue is not reasonable for sqlglot because this is also not a correct spark syntax. This is preventing class Enumerate(Value) from producing the correct SQL. I am not familiar with spark sql, so do you have any thoughts here?
Also, if we cannot find an equal sqlglot expression, could using class Enumerate(Relation) be a better solution?
@cpcloud Trying to simple my previous question. If sqlglot cannot return a struct including arrays' items and positions, which dtype ops.Enumerate can return to respect the two columns from the generated SQL?
Ok, after experimenting in #8429, I don't think enumerate implemented like this is possible.
I'm going to take a different approach, and experiment with adding a Table.unnest(column) method instead, that will be the "tabular" unnest implementation for Ibis.
Thanks @cpcloud for prioritizing the offset parameter! While working on the Table.unnest(column) implementation, could you please also consider another feature requests to add preserved_empty parameter. This would ensure empty arrays are retained as null after the unnest operation.
As for the implementation, we can replace sqlglot expression from EXPLODE(x) into EXPLODE_OUTER(x), or POSEXPLODE(x) into POSEXPLODE_OUTER(x) if offset=True sets.
Thanks again here!
@cpcloud We've found a workaround solution (reliant on the feature in #8892). This code effectively unnests two columns sequentially, includes an offset ID, and handles empty arrays as null. Thanks to @TrevorBergeron for the helpful ideas!
import ibis
bq = ibis.bigquery.connect(project_id="bigframes-dev")
table = bq.table("bigframes-dev.ibis_gbq_testing.array_types")
offset_array_id = "offset_array_0"
offset_array = ibis.generate_array(
0, ibis.greatest(
0,
ibis.least(table.x.length() - 1, table.y.length() - 1)
)).name(offset_array_id)
table_w_offset_array = table.select(offset_array, table.x, table.y)
unnest_offset_id = "unnest_offset_0"
unnest_offset = table_w_offset_array[offset_array_id].unest().name(unnest_offset_id)
table_w_offset = table_w_offset_array.select(unnest_offset, table.x, table.y)
table_w_unnest = table_w_offset.select(
table_w_offset[unnest_offset_id].name(unnest_offset_id),
table_w_offset.x[table_w_offset[unnest_offset_id]].name("x"),
table_w_offset.y[table_w_offset[unnest_offset_id]].name("y"),
)
This was closed by #9423 (Table.unnest), which supports an offset parameter.