ibis feat(bigquery): Include `offset` parameter to unnest

Is your feature request related to a problem?

It's not possible to express that I want to save the original order, such as with the WITH OFFSET AS clause for UNNEST in BigQuery. With the current version of unnest, it only returns a single column. I suppose it would be possible to add a with_offset parameter here, but then the return type might have to be a struct column or maybe a tuple of two columns.

This issue has the same request as #7781

Describe the solution you'd like

I'd like to add an offset parameter to the unnest function. When offset is True, the compiler can returns a sge.Posexplode() rather than sge.Explode() node. Then, sqlglot can handle the SQL translation as expected. Here is the draft PR: https://github.com/ibis-project/ibis/pull/8354/files With that change, the translated SQL looks good to me:

In [1]: import ibis
   ...: bq = ibis.bigquery.connect(project_id="bigframes-dev")
   ...: table = bq.table("bigframes-dev.ibis_gbq_testing.array_types")
   ...: ibis.to_sql(table.z.unnest(offset=True), dialect="bigquery")
Out[1]: 
SELECT
  IF(pos = pos_2, `z`, NULL) AS `z`,
  IF(pos = pos_2, pos_2, NULL) AS pos_2
FROM `bigframes-dev`.`ibis_gbq_testing`.`array_types` AS `t0`
CROSS JOIN UNNEST(GENERATE_ARRAY(0, GREATEST(ARRAY_LENGTH(`t0`.`z`)) - 1)) AS pos
CROSS JOIN UNNEST(`t0`.`z`) AS `z` WITH OFFSET AS pos_2
WHERE
  pos = pos_2
  OR (
    pos > (
      ARRAY_LENGTH(`t0`.`z`) - 1
    ) AND pos_2 = (
      ARRAY_LENGTH(`t0`.`z`) - 1
    )
  )

However, the ibis node type is not correct. The ibis assumes this node only returns a single value and ignore the pos_2 here. Here is the repro code:

In [3]: type(table.z.unnest(offset=True))
Out[3]: ibis.expr.types.numeric.FloatingColumn

In [7]: table.select(["scalar_column", table.z.unnest(offset=True)])
Out[7]: 
r0 := DatabaseTable: array_types
  x             array<!int64>
  y             array<!string>
  z             array<!float64>
  grouper       string
  scalar_column float64

Project[r0]
  scalar_column: r0.scalar_column
  z:             Unnest(r0.z, offset=True)

In [10]: table.select(["scalar_column", table.z.unnest(offset=True)]).pos_2
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[10], line 1
----> 1 table.select(["scalar_column", table.z.unnest(offset=True)]).pos_2

File ~/src/forkibis2/ibis/expr/types/relations.py:859, in Table.__getattr__(self, key)
    854     hint = common_typos[key]
    855     raise AttributeError(
    856         f"{type(self).__name__} object has no attribute {key!r}, did you mean {hint!r}"
    857     )
--> 859 raise AttributeError(f"'Table' object has no attribute {key!r}")

AttributeError: 'Table' object has no attribute 'pos_2'

Any suggestions to correct the change so that ibis can return pos_2 as the offset?

What version of ibis are you running?

8.0.0

What backend(s) are you using, if any?

BigQuery

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Feb 14 '24 18:02 chelsea-lin

Let's work through the API design first before jumping into compiled SQL.

The problem we need to solve to expose offset functionality may or may not have anything to do with the details of the compiled SQL.

Feb 15 '24 15:02 cpcloud

Thinking about this some more I think this functionality warrants a new API.

I'll propose one: ArrayValue.enumerate(start=0).

This API would be analogous to Python's built-in function enumerate which accepts an Iterable[T] and returns Iterable[tuple[int, T]].

We'd do something similar, so the full signature of enumerate would be (using pseudocode):

# for ArrayValue[T]
def enumerate(self, start: int = 0) -> dt.StructValue[index: int, value: T]:
    ...

Thoughts?

We should probably experiment with two different flavors of backends (BigQuery and DuckDB) to determine feasibility of compilation here.

Feb 15 '24 15:02 cpcloud

While the proposed enumerate is familier to Python users, it is important to consider how this will translate into SQL and work with database execution. Let's discuss a few concerns:

Efficiency: If a user calls table.x.enumerate(0) and table.x.enumerate(1) in succession, could this generate two SQL queries to be executed?
What's the equivalent API to table.select([table["grouper"], table.x.unnest(offset=True)])?
Implement difficulty: I feel it's not easy to implement it (maybe I am too new to ibis expression tree). Currently, ibis expression tree presents table.arr.unnest() AS SELECT EXPLODE(arr) FROM table and let explode_to_unnest sqlglot.transformer to compile it into SELECT UNNEST(arr) FROM table CROSS JOIN UNNEST(arr)(not exactly but equivalent).

Feb 15 '24 18:02 chelsea-lin

In case you didn't see that, @tswast also proposed a new API at #7781 calling ArrayTable.as_table(offset) if we want to keep current unnest API unchanged.

Feb 15 '24 18:02 chelsea-lin

If we want to continue the implementation for unnest(offset=True) API, I am trying to find a solution to return a struct in sqlglot as you commented at #8354 The solution is asking ibis compiler returns SELECT (SELECT AS STRUCT POSEXPLODE(arr)) FROM table and sqlglot can convert it into SELECT (SELECT AS STRUCT x, pos) FROM table FROM table CROSS JOIN UNNEST(x). But sqlglot does not work as expected, and I am trying to thinking about the solution.

Feb 15 '24 18:02 chelsea-lin

driveby-ing: def enumerate(arr: ArrayValue) feels very close to def zip(arr: ArrayValue, *others: ArrayValue), and I have wanted zip before but did workarounds.

Feb 15 '24 23:02 NickCrews

@NickCrews We do have array zip, not sure if you've seen that: https://ibis-project.org/reference/expression-collections#ibis.expr.types.arrays.ArrayValue.zip

Feb 16 '24 00:02 cpcloud

@chelsea-lin I'll try to address your concerns here and provide a path forward:

Efficiency: If a user calls table.x.enumerate(0) and table.x.enumerate(1) in succession, could this generate two SQL queries to be executed?

Yes. Is that different from any other expression that takes input arguments?

What's the equivalent API to table.select([table["grouper"], table.x.unnest(offset=True)])?

table.select("grouper", table.x.enumerate())

Implement difficulty: I feel it's not easy to implement it (maybe I am too new to ibis expression tree). Currently, ibis expression tree presents table.arr.unnest() AS SELECT EXPLODE(arr) FROM table and let explode_to_unnest sqlglot.transformer to compile it into SELECT UNNEST(arr) FROM table CROSS JOIN UNNEST(arr)(not exactly but equivalent).

Start by implementing the enumerate API and don't worry about sqlglot until you can produce an Ibis expression.

Here's the approach I would take:

Define a new ops.Value subclass (class Enumerate(Value)) in ibis/expr/operations/arrays.py
Define an enumerate method on ArrayValue (in ibis/expr/types/arrays.py) that constructs ops.Enumerate(...).to_expr()
Figure out the SQL syntax that corresponds to the semantics of the operation
Figure out the sqlglot objects that correspond to that SQL syntax.

Feb 21 '24 18:02 cpcloud

@cpcloud thanks! The implementation looks good overall. My remaining question is about ops.Enumerate. It needs to return a dt.Struct and expect a sqlglot expression equaling to STRUCT(POSEXPLODE(arr)). Unfortunately, this cannot be directly converted to BigQuery syntax correctly (please check tobymao/sqlglot#2978). Also the issue is not reasonable for sqlglot because this is also not a correct spark syntax. This is preventing class Enumerate(Value) from producing the correct SQL. I am not familiar with spark sql, so do you have any thoughts here? Also, if we cannot find an equal sqlglot expression, could using class Enumerate(Relation) be a better solution?

Feb 22 '24 00:02 chelsea-lin

@cpcloud Trying to simple my previous question. If sqlglot cannot return a struct including arrays' items and positions, which dtype ops.Enumerate can return to respect the two columns from the generated SQL?

Feb 22 '24 18:02 chelsea-lin

Ok, after experimenting in #8429, I don't think enumerate implemented like this is possible.

I'm going to take a different approach, and experiment with adding a Table.unnest(column) method instead, that will be the "tabular" unnest implementation for Ibis.

Feb 22 '24 19:02 cpcloud

Thanks @cpcloud for prioritizing the offset parameter! While working on the Table.unnest(column) implementation, could you please also consider another feature requests to add preserved_empty parameter. This would ensure empty arrays are retained as null after the unnest operation. As for the implementation, we can replace sqlglot expression from EXPLODE(x) into EXPLODE_OUTER(x), or POSEXPLODE(x) into POSEXPLODE_OUTER(x) if offset=True sets. Thanks again here!

Feb 22 '24 23:02 chelsea-lin

@cpcloud We've found a workaround solution (reliant on the feature in #8892). This code effectively unnests two columns sequentially, includes an offset ID, and handles empty arrays as null. Thanks to @TrevorBergeron for the helpful ideas!

import ibis
bq = ibis.bigquery.connect(project_id="bigframes-dev")
table = bq.table("bigframes-dev.ibis_gbq_testing.array_types")

offset_array_id = "offset_array_0"
offset_array = ibis.generate_array(
    0, ibis.greatest(
        0, 
        ibis.least(table.x.length() - 1, table.y.length() - 1)
    )).name(offset_array_id)
table_w_offset_array = table.select(offset_array, table.x, table.y)

unnest_offset_id = "unnest_offset_0"
unnest_offset = table_w_offset_array[offset_array_id].unest().name(unnest_offset_id)
table_w_offset = table_w_offset_array.select(unnest_offset, table.x, table.y)

table_w_unnest = table_w_offset.select(
    table_w_offset[unnest_offset_id].name(unnest_offset_id),
    table_w_offset.x[table_w_offset[unnest_offset_id]].name("x"),
    table_w_offset.y[table_w_offset[unnest_offset_id]].name("y"),
)

Apr 04 '24 17:04 chelsea-lin

This was closed by #9423 (Table.unnest), which supports an offset parameter.

Jul 26 '24 15:07 cpcloud