ibis bug: many Operations can't handle NULL optional arguments

What happened?

consider the implementation of

@public
class Substring(Value):
    arg: Value[dt.String]
    start: Value[dt.Integer]
    length: Optional[Value[dt.Integer]] = None

This allows you to leave out the length argument, meaning "until the end of the string". But this is only represented on the python side. If the length is evaluated at runtime to be NULL, then this errors on some backends, like postgres, or gives the wrong result on duckdb.

For example, ibis.literal("abcde").substr(2, ibis.literal(1).nullif(1)) results in NULL in duckdb, but I would expect it to be "cde".

During compilation, we are naive and do the NULL checking only on the python side:

def visit_Substring(self, op, *, arg, start, length):
        start += 1
        arg_length = self.f.length(arg)

        if length is None:
            return self.if_(
                start >= 1,
                self.f.substring(arg, start),
                self.f.substring(arg, start + arg_length),
            )
        return self.if_(
            start >= 1,
            self.f.substring(arg, start, length),
            self.f.substring(arg, start + arg_length, length),
        )

I discovered this when adding more tests to #8832 .

I think what we should do is make Substring more like

@public
class Substring(Value):
    arg: Value[dt.String]
    start: Value[dt.Integer]
    length:Value[dt.Integer] = ibis.null()

and then use sql CASE statements if we can't determine the nullness statically.

but I wasn't able to get this to work. Does this seem like the right direction? Any tips on what to do here?

What version of ibis are you using?

main

What backend(s) are you using, if any?

No response

Relevant log output

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Mar 30 '24 01:03 NickCrews

None indicates that the argument wasn't provided, NULL is something else. We should avoid conflating these two uses.

Mar 30 '24 11:03 cpcloud

The failures notwithstanding, the duckdb behavior looks correct to me in this case: null inputs produce null outputs, NULL doesn't mean "no argument".

The representation of "argument not provided" is unrelated to whether an input is NULL.

Mar 30 '24 11:03 cpcloud

hmmm, I think you are right. We are victims of the SQL standard here. I don't really like this footgun:

ibis.literal("abcde").substr(2, None) gives "cde", but
ibis.literal("abcde").substr(2, ibis.null()) gives NULL The only way to get around this would be if we did a length = length.fillnull(arg.length()) during compilation, but then this would be unexpected for more experienced SQL users, and less performant. Maybe just better documentation? This same thing exists for many other ops (eg StringFind with the optional start and end), but probably these are less used.

Mar 30 '24 22:03 NickCrews

I think there are other approaches to the problem. I'll state the problem so that we have it written somewhere.

Ibis uses None for two things: 1) the default value for optional arguments and 2) a value that users can use for SQL's NULL.

Users explicitly passing None expect it to behave like they explicitly passed ibis.null() or an equivalent such as ibis.NA.

Since None is being co-opted for use as the "no-argument" sentinel value, Ibis interprets it that way instead of as NULL, leading to a divergence in behavior between the user-facing meaning of None versus how we use it internally.

Apr 01 '24 13:04 cpcloud

One approach that will break user code, but is probably the right approach IMO is to use some other sentinel value to mean "argument wasn't provided".

Bit of a hack, but fairly common is to create a dummy object like NO_ARGUMENT = object(), set that to the default value wherever None is used, and then use arg is NO_ARGUMENT to check for whether an argument was priovided.

Apr 01 '24 13:04 cpcloud

I find the NO_ARGUMENT solution fairly good, I think it might be worth going through a deprecation cycle to get this right.

Apr 01 '24 17:04 NickCrews

wait, would this then mean that for ibis.literal("abcde").substr(2, None), the None gets cast to ibis.null() and the result is therefore NULL? I think this actually might be an even worse footgun than what we have now.

Apr 01 '24 17:04 NickCrews

I'm looking into tackling this one, and while experimenting on duckdb I noticed that:

when no arg is provided for the length, the default is going to the end. Currently this matches Ibis.

select substring('Hello', 2);
┌────────────────────────────┐
│ main.substring('Hello', 2) │
│          varchar           │
├────────────────────────────┤
│ ello                       │
└────────────────────────────┘

If you provide NULL you get a blank result/NULL which also matches the way Ibis is behaving.

select substring('Hello', 2, NULL);
┌──────────────────────────────────┐
│ main.substring('Hello', 2, NULL) │
│             varchar              │
├──────────────────────────────────┤
│                                  │
└──────────────────────────────────┘

Clearly you can't pass 'None' as an argument in DuckDB.

To answer:

wait, would this then mean that for ibis.literal("abcde").substr(2, None), the None gets cast to ibis.null() and the result is therefore NULL? I think this actually might be an even worse footgun than what we have now.

I think, that would be the behavior. Since if someone purposely passes None , to mean that would mean they want to pass NULL.

Unless I'm missing something here. But I agree that the way that the docs read here can be confusing. https://ibis-project.org/reference/expression-strings.html#ibis.expr.types.strings.StringValue.substr

It could be misunderstood as if None is explicitly passed that means that they'll get the entire remainder of the string.

I think this is something we can fix with docs, or more examples in the docstrings.

May 21 '24 17:05 ncclementi

Naty, your understanding of the above is correct.

I actually think the current length=None behavior makes sense, and is certainly easier to document than adding a new sentinel object (which programmatic users would have to import, and we would also have to expose the type of to properly type annotate a method). There are also some APIs that have None as an option that does mean something (e.g. keep=None in distinct).

If we do decide to go with a new sentinel value to represent no argument provided, we'd want to do it in a bit more elaborate way to properly support type annotations and mypy. This unfortunately a bit verbose, but is the recommended pattern for now.

import enum

# Other good names might be UNDEFINED, NO_ARGUMENT, or NO_VALUE.
# I have a slight preference for a name lacking an underscore, but
# no strong thoughts.
class UnsetType(enum.Enum):
    UNSET = "UNSET"

    # These methods aren't strictly needed, but make for nicer reprs of when
    # introspecting function signatures
    def __str__(self):
        return "UNSET"

    __repr__ = __str__


UNSET = UnsetType.UNSET


def my_function(x: int, y: int | UnsetType = UNSET) -> int:
    if y is not UNSET:
        return x + y
    return x + 1

Anyway, my preference would be to fix this by updating docstrings as needed, but I wouldn't be opposed to switching to a different sentinel for no argument. IF we do that, I'd vote we do it as a breaking change rather than trying to deprecate explicit None usage in those cases, since managing all that code sounds a bit tricky and would only probably matter to users using ibis in a programmatic manner.

May 22 '24 19:05 jcrist

I was looking into this again on the Ibis side and I think now I'm more inclined to think about the "UNSET" implementation, because from the user perspective, this behavior below is very confusing:

In [1]: import ibis

In [2]: ibis.options.interactive = True

In [3]: ibis.literal("abcde").substr(2)
Out[3]: 'cde'  

In [4]: ibis.literal("abcde").substr(2, None)
Out[4]: 'cde'   

In [5]: ibis.literal(1).nullif(1)
Out[5]: None
 
In [6]: ibis.literal("abcde").substr(2, ibis.literal(1).nullif(1))
Out[6]: None

I can't find a good way of explaining this behavior in the docs. I think switching to a new no argument definition is the way to go here. ~~I'll open a PR with this~~

EDIT: TIL that Out[5] is not a python None. Having the repr making this clear would solve this problem in my opinion.

May 24 '24 15:05 ncclementi

@NickCrews - from reading the above (and also talking to @ncclementi just now about her example above), I'm wondering if part of this stems from our interactive repr not distinguishing scalars from their evaluated results. Using Naty's example above:

This line reprs as None, but it isn't None (it's actually an ibis scalar value), we just display it that way.

In [5]: ibis.literal(1).nullif(1)
Out[5]: None

Then when it's passed to create a larger expression, you may expect it to behave the same as if None actually was passed in (reasonably so, since it was repr'd that way). This leads to an unexpected result, since the length arg isn't actually None, we just displayed it that way.

In [6]: ibis.literal("abcde").substr(2, ibis.literal(1).nullif(1))
Out[6]: None

If instead we updated the repr so that we better distinguished ibis values from their evaluated in-memory counterparts, perhaps this confusion would be avoided. For example, say cell 5 repr'd this way instead:

In [5]: ibis.literal(1).nullif(1)
Out[5]:
┌──────┐ 
│ NULL │
└──────┘

In this case it's clear that Out[5] isn't a None value, so there's less reason to believe it would behave the same as passing in length=None to the substr call. See #9247 for more discussion.

May 24 '24 18:05 jcrist

I agree that the repr is a bit confusing, but I think that is orthogonal.

I want passing a python None and a runtime NULL to behave the same. This is a little different from SQL, but I think it is more ergonomic, for example for some rows selecting the entire rest of the string, but for others limiting the length. This is actually a larger issue, for example if we did this then we would also want to change the behavior of StringValue.find(), Value.between(), and maybe others to match

To be concrete, I would like these to be the semantics:

import ibis

ibis.options.interactive = True

t = ibis.memtable({"a": ["abc", "def"]})
t = t.mutate(
    # I want ["bc", "ef"].
    t.a.substr(1).name("implicit"),
    # I want ["bc", "ef"]
    # Has some compile time optimization, generates the SQL SUBSTR(a, 1)
    t.a.substr(1, None).name("explicit_none"),
    # I want ["bc", "ef"]
    # Has some compile time optimization, generates the SQL SUBSTR(a, 1)
    t.a.substr(1, ibis.null()).name("explicit_null"),
    # I want ["b", "ef"]
    # Since we can't know the NULLness of length at compile time, we have to do
    # SUBSTR(a, 1, COALESCE(<length>, LENGTH(a))
    t.a.substr(1, t.a.contains("a").ifelse(1, None)).name("runtime"),
)
t

Would this be possible?

May 24 '24 19:05 NickCrews

Making null behave like a default value is not consistent with basically all of the rest of Ibis, so I don't think we can implement this behavior.

NULL and None can both behave like NULL, that's pretty much the only way I can see to make everything consistent.

May 24 '24 19:05 cpcloud

t.a.substr(1, ibis.null()).name("explicit_null"),
# I want ["b", "ef"]

This currently, as is, doesn't work, it'll error with TypeError because substr expects length: int | ir.IntegerValue | None = None,. But if it were to be changed to match @NickCrews asked behavior, that would be incompatible to what duckdb is doing when it gets a NULL passed. I think having a different behavior than the backend can be the root of a lot of problems.

May 25 '24 00:05 ncclementi

This PR https://github.com/ibis-project/ibis/pull/9265 was merged, and hopefully it brings more clarity to the original confusion

I'd be inclined to close this issue. Any reservations?

Jun 03 '24 15:06 ncclementi

ibis ibis copied to clipboard

bug: many Operations can't handle NULL optional arguments

What happened?

What version of ibis are you using?

What backend(s) are you using, if any?

Relevant log output

Code of Conduct

ibis
ibis copied to clipboard