ibis icon indicating copy to clipboard operation
ibis copied to clipboard

refactor(api): restrict arbitrary input nesting

Open kszucs opened this issue 1 year ago • 22 comments

Previously we allowed arbitrary nesting of input expressions for various API methods like table.join, table.group_by, etc. While this was backward compatible with 8.0.0 it also exposes additional ambiguity which can be confusing for users and difficult to reason about.

This change restricts the nesting of input expressions to a "single level" of nesting with the exception of the first positional argument which can be a list of expressions, but only if there are no more positional arguments. Examples:

t.select([t.foo, t.bar])  # OK
t.select(t.foo, t.bar)    # OK
t.select([t.foo], t.bar)  # Error
t.select(t.foo, name=t.bar)  # OK
t.select([t.foo], name=t.bar)  # OK

Aims to resolve https://github.com/ibis-project/ibis/issues/7819 and https://github.com/ibis-project/ibis/issues/8304, depends on https://github.com/ibis-project/ibis/pull/8884

kszucs avatar Apr 09 '24 12:04 kszucs

There are some ambiguities around the first arguments which can be a list of expressions:

table.select(["a", ["b"]])

# constructs

r0 := UnboundTable: table
  a int8
  b int16
  c int32
  d int64
  e float32
  f float64
  g string
  h boolean
  i timestamp
  j date
  k time

Project[r0]
  a:      r0.a
  ('b',): ['b']

a is interepreted as a column whereas ["b"] interpreted as an array literal.

kszucs avatar Apr 09 '24 12:04 kszucs

Apparently we also have a backend test case using a mapping as a single positional argument.

kszucs avatar Apr 09 '24 12:04 kszucs

Why can't we keep

t.select([t.foo], t.bar)

?

It doesn't seem meaningfully different from allowing

t.select([t.foo], bar=t.bar)

cpcloud avatar Apr 12 '24 11:04 cpcloud

If we allow:

t.select([t.foo], t.bar)

then should we allow

t.select([t.foo], [t.bar])

as well?

kszucs avatar Apr 12 '24 13:04 kszucs

I think so. It's really things like t.select([[...]], [[...]]) that should be disallowed.

cpcloud avatar Apr 12 '24 13:04 cpcloud

My preference would be to either provide args as variadics or as a single list-like argument. Even then we have confusion when to treat list-like inputs as literals vs value list.

kszucs avatar Apr 12 '24 14:04 kszucs

We talked today during triage and decided that at minimum we want to remove the support for arbitrarily nested expressions (as Phillip noted above).

I actually agree with the initial proposal at the top (only accept lists of column refs if select is a 1-arg function, otherwise treat it as variadic). Coincidentally this matches how polars select works, so there's some ecosystem-consistency there. The trick with doing this as a breaking change though is that previously working code may instead silently be treated as an array literal.

With this PR, what happens here? Does this error, or does it silently coerce these to array literals?

t.select(["a", "b"], ["c", "d"])  # is this two array literals or an error?

t.select("a", "b", ["c", "d"])   # is ["c", "d"] an array literal, or does this error?

Either way, I think we definitely want to forbid the case that phillip brought up before 9.0. I'm not opposed to the full changes here either, I think they make things more consistent, but we can always do that later IMO in 10.0.

jcrist avatar Apr 15 '24 18:04 jcrist

t.select(["a", "b"], ["c", "d"])  # is this two array literals or an error?

yes, two array literals, not erroring

t.select("a", "b", ["c", "d"])   # is ["c", "d"] an array literal, or does this error?

yes, an array literal, not erroring

kszucs avatar Apr 17 '24 11:04 kszucs

I would prefer deciding what should be the ideal long term behaviour then extend that to be backward compatible and either raise a deprecation warning or provide a config option to keep the previous behaviour.

kszucs avatar Apr 17 '24 14:04 kszucs

Let's keep the full changeset here, and make sure to have both a BREAKING CHANGE bit as well as documentation on the behavior in the relevant APIs (select and mutate).

cpcloud avatar Apr 17 '24 14:04 cpcloud

Another thing I'm not sure about is whether we should do the dereferencing here or not? Like should t.bind(totally_different_table.col) raise?

kszucs avatar Apr 19 '24 12:04 kszucs

Seems like it might keep bind more flexible and easier to maintain if we dereference elsewhere.

cpcloud avatar Apr 19 '24 12:04 cpcloud

Actually it could help code consolidation. I can try it after the dereferencer improvement gets merged:

  • #8992

kszucs avatar Apr 19 '24 12:04 kszucs

We should probably kill destructure() now that proper alternatives (lift) are available.

cpcloud avatar Apr 25 '24 17:04 cpcloud

I'm going to rip out destructure in a separate PR.

cpcloud avatar Apr 25 '24 17:04 cpcloud

I'm going to rip out destructure in a separate PR.

Updated the test case to expect an input error rather than an integrity error. We should update unwrap_aliases() to raise ibis input error instead of integrity error too.

kszucs avatar Apr 26 '24 07:04 kszucs

One confusing case:

In [4]: table = ibis.table({"a": "int", "b": "string"})

In [5]: table.select(["a", ["b"]])
Out[5]:
r0 := UnboundTable: unbound_table_1
  a int64
  b string

Project[r0]
  a:      r0.a
  ('b',): ['b']

kszucs avatar Apr 26 '24 17:04 kszucs

@NickCrews Can you review this again? This is a remaining blocker for the 9.0 release which we'd like to get out early next week.

cpcloud avatar Apr 28 '24 13:04 cpcloud

I just took a long look.

Most things look great, a few nits, but the one substantial thing is the t.bind([1,"a"]) behavior. I have an inline comment above. Can someone fill me in on the context there a little more? I don't think this is a blocker for me.

NickCrews avatar Apr 29 '24 20:04 NickCrews

@NickCrews Thanks for doing a deep dive on tests. There were indeed a few cases that weren't tested and also the lambda _: 2 case wasn't yet implemented. Appreciate the feedback!

cpcloud avatar Apr 29 '24 20:04 cpcloud

Cool this looks great then!

NickCrews avatar Apr 30 '24 04:04 NickCrews

@NickCrews Can you approve the PR?

cpcloud avatar Apr 30 '24 11:04 cpcloud

Sorry I was on the mobile app where the approval status isn't even shown so i didn't know there was anything to do

NickCrews avatar Apr 30 '24 14:04 NickCrews