ibis
ibis copied to clipboard
feat: `x.isin(<contains NULLs>)` should be able to return False
Is your feature request related to a problem?
Related to https://github.com/ibis-project/ibis/issues/8079. In that issue, we settled on a defined behavior. Unfortunately, I think we settled on an unintuitive one, I think mostly due to that being the default behavior in most SQL engines.
For example, currently in pseudocode 1 in [2, 3, NULL] returns NULL, but I would expect it to return False.
In general, I expect that iff x is null, then x.isin(
There is one possible explanation where 1 in [2,3, NULL] should be equivalent to 1=2 OR 1=3 OR 1=NULL, and following that explanation, the current implementation makes sense.
IDK, are there other examples where ibis diverges from normal SQL semantics? Are there other examples where ibis follows SQL semantics, when there is some alternative that would be sane/better for some reason?
At the least, I just wanted to flag this, and have there be a public issue where we wrote down our reasoning and decision.
import ibis
def _drop_nulls(val):
if isinstance(val, ibis.Column):
return val.name("foo").as_table().filter(ibis._.foo.notnull()).foo
if isinstance(val, ibis.ir.ArrayValue):
return val.filter(ibis._.notnull())
arr = ibis.literal(val)
return _drop_nulls(arr)
def isin(needle, haystack):
return needle.isin(_drop_nulls(haystack))
def notin(needle, haystack):
return needle.notin(_drop_nulls(haystack))
t = ibis.memtable({"a": [1, 2, None], "b": [1, 3, None]})
t.mutate(
a_in_b=t.a.isin(t.b),
a_in_b_expected=isin(t.a, t.b),
a_notin_b=t.a.notin(t.b),
a_notin_b_expected=notin(t.a, t.b),
a_in_const=t.a.isin([1, 3, None]),
a_in_const_expected=isin(t.a, [1, 3, None]),
)
┏━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ a ┃ b ┃ a_in_b ┃ a_in_b_expected ┃ a_notin_b ┃ a_notin_b_expected ┃ a_in_const ┃ a_in_const_expected ┃
┡━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ float64 │ float64 │ boolean │ boolean │ boolean │ boolean │ boolean │ boolean │
├─────────┼─────────┼─────────┼─────────────────┼───────────┼────────────────────┼────────────┼─────────────────────┤
│ 1.0 │ 1.0 │ True │ True │ False │ False │ True │ True │
│ 2.0 │ 3.0 │ NULL │ False │ NULL │ True │ NULL │ False │
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
└─────────┴─────────┴─────────┴─────────────────┴───────────┴────────────────────┴────────────┴─────────────────────┘
pandas does it's own thing and treats NULLs like their own distinct value, no different from any other, so that is PARTIALLY what I would expect.
import pandas as pd
df = pd.DataFrame({"a": [1, 2, None], "b": [1, 3, None]})
df["a_in_b"] = df.a.isin(df.b)
df["a_notin_b"] = ~df.a.isin(df.b)
print(df.to_markdown())
| a | b | a_in_b | a_notin_b | |
|---|---|---|---|---|
| 0 | 1 | 1 | True | False |
| 1 | 2 | 3 | False | True |
| 2 | nan | nan | True | False |
polars has exactly the semantics I would expect.
import polars as pl
df = pl.DataFrame({"a": [1, 2, None], "b": [1, 3, None]})
df = df.with_columns(
a_in_b=pl.col("a").is_in(pl.col("b")),
a_notin_b=~pl.col("a").is_in(pl.col("b")),
)
print(df)
┌──────┬──────┬────────┬───────────┐
│ a ┆ b ┆ a_in_b ┆ a_notin_b │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ bool ┆ bool │
╞══════╪══════╪════════╪═══════════╡
│ 1 ┆ 1 ┆ true ┆ false │
│ 2 ┆ 3 ┆ false ┆ true │
│ null ┆ null ┆ null ┆ null │
└──────┴──────┴────────┴───────────┘
What is the motivation behind your request?
I am trying to do something like
existing = conn.table("my_table") # id column can be null
new = conn.read_parquet(...)
not_seen_before = new.filter(_.id.notin(existing.id))
conn.insert("my_table", not_seen_before)
but this currently results in not_seen_before always being empty, where I want it to sometimes have some rows.
Describe the solution you'd like
I think something like the implementation that I have in the repro.
What version of ibis are you running?
main
What backend(s) are you using, if any?
duckdb, but it should be
Code of Conduct
- [x] I agree to follow this project's Code of Conduct