ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat: `x.isin(<contains NULLs>)` should be able to return False

Open NickCrews opened this issue 7 months ago • 0 comments

Is your feature request related to a problem?

Related to https://github.com/ibis-project/ibis/issues/8079. In that issue, we settled on a defined behavior. Unfortunately, I think we settled on an unintuitive one, I think mostly due to that being the default behavior in most SQL engines.

For example, currently in pseudocode 1 in [2, 3, NULL] returns NULL, but I would expect it to return False. In general, I expect that iff x is null, then x.isin() to be null. The current behavior makes it so that some nonnull values get turned into null when sent through .isin().

There is one possible explanation where 1 in [2,3, NULL] should be equivalent to 1=2 OR 1=3 OR 1=NULL, and following that explanation, the current implementation makes sense.

IDK, are there other examples where ibis diverges from normal SQL semantics? Are there other examples where ibis follows SQL semantics, when there is some alternative that would be sane/better for some reason?

At the least, I just wanted to flag this, and have there be a public issue where we wrote down our reasoning and decision.

import ibis


def _drop_nulls(val):
    if isinstance(val, ibis.Column):
        return val.name("foo").as_table().filter(ibis._.foo.notnull()).foo
    if isinstance(val, ibis.ir.ArrayValue):
        return val.filter(ibis._.notnull())
    arr = ibis.literal(val)
    return _drop_nulls(arr)


def isin(needle, haystack):
    return needle.isin(_drop_nulls(haystack))


def notin(needle, haystack):
    return needle.notin(_drop_nulls(haystack))


t = ibis.memtable({"a": [1, 2, None], "b": [1, 3, None]})
t.mutate(
    a_in_b=t.a.isin(t.b),
    a_in_b_expected=isin(t.a, t.b),
    a_notin_b=t.a.notin(t.b),
    a_notin_b_expected=notin(t.a, t.b),
    a_in_const=t.a.isin([1, 3, None]),
    a_in_const_expected=isin(t.a, [1, 3, None]),
)
┏━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ a       ┃ b       ┃ a_in_b  ┃ a_in_b_expected ┃ a_notin_b ┃ a_notin_b_expected ┃ a_in_const ┃ a_in_const_expected ┃
┡━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ float64 │ float64 │ boolean │ boolean         │ boolean   │ boolean            │ boolean    │ boolean             │
├─────────┼─────────┼─────────┼─────────────────┼───────────┼────────────────────┼────────────┼─────────────────────┤
│     1.0 │     1.0 │ True    │ True            │ False     │ False              │ True       │ True                │
│     2.0 │     3.0 │ NULL    │ False           │ NULL      │ True               │ NULL       │ False               │
│    NULL │    NULL │ NULL    │ NULL            │ NULL      │ NULL               │ NULL       │ NULL                │
└─────────┴─────────┴─────────┴─────────────────┴───────────┴────────────────────┴────────────┴─────────────────────┘

pandas does it's own thing and treats NULLs like their own distinct value, no different from any other, so that is PARTIALLY what I would expect.

import pandas as pd

df = pd.DataFrame({"a": [1, 2, None], "b": [1, 3, None]})
df["a_in_b"] = df.a.isin(df.b)
df["a_notin_b"] = ~df.a.isin(df.b)
print(df.to_markdown())
a b a_in_b a_notin_b
0 1 1 True False
1 2 3 False True
2 nan nan True False

polars has exactly the semantics I would expect.

import polars as pl

df = pl.DataFrame({"a": [1, 2, None], "b": [1, 3, None]})
df = df.with_columns(
    a_in_b=pl.col("a").is_in(pl.col("b")),
    a_notin_b=~pl.col("a").is_in(pl.col("b")),
)
print(df)
┌──────┬──────┬────────┬───────────┐
│ a    ┆ b    ┆ a_in_b ┆ a_notin_b │
│ ---  ┆ ---  ┆ ---    ┆ ---       │
│ i64  ┆ i64  ┆ bool   ┆ bool      │
╞══════╪══════╪════════╪═══════════╡
│ 1    ┆ 1    ┆ true   ┆ false     │
│ 2    ┆ 3    ┆ false  ┆ true      │
│ null ┆ null ┆ null   ┆ null      │
└──────┴──────┴────────┴───────────┘

What is the motivation behind your request?

I am trying to do something like

existing = conn.table("my_table")  # id column can be null
new = conn.read_parquet(...)
not_seen_before = new.filter(_.id.notin(existing.id))
conn.insert("my_table", not_seen_before)

but this currently results in not_seen_before always being empty, where I want it to sometimes have some rows.

Describe the solution you'd like

I think something like the implementation that I have in the repro.

What version of ibis are you running?

main

What backend(s) are you using, if any?

duckdb, but it should be

Code of Conduct

  • [x] I agree to follow this project's Code of Conduct

NickCrews avatar Apr 24 '25 00:04 NickCrews