ibis
ibis copied to clipboard
bug: many Operations can't handle NULL optional arguments
What happened?
consider the implementation of
@public
class Substring(Value):
arg: Value[dt.String]
start: Value[dt.Integer]
length: Optional[Value[dt.Integer]] = None
This allows you to leave out the length argument, meaning "until the end of the string". But this is only represented on the python side. If the length
is evaluated at runtime to be NULL, then this errors on some backends, like postgres, or gives the wrong result on duckdb.
For example, ibis.literal("abcde").substr(2, ibis.literal(1).nullif(1))
results in NULL in duckdb, but I would expect it to be "cde".
During compilation, we are naive and do the NULL checking only on the python side:
def visit_Substring(self, op, *, arg, start, length):
start += 1
arg_length = self.f.length(arg)
if length is None:
return self.if_(
start >= 1,
self.f.substring(arg, start),
self.f.substring(arg, start + arg_length),
)
return self.if_(
start >= 1,
self.f.substring(arg, start, length),
self.f.substring(arg, start + arg_length, length),
)
I discovered this when adding more tests to #8832 .
I think what we should do is make Substring more like
@public
class Substring(Value):
arg: Value[dt.String]
start: Value[dt.Integer]
length:Value[dt.Integer] = ibis.null()
and then use sql CASE statements if we can't determine the nullness statically.
but I wasn't able to get this to work. Does this seem like the right direction? Any tips on what to do here?
What version of ibis are you using?
main
What backend(s) are you using, if any?
No response
Relevant log output
No response
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
None indicates that the argument wasn't provided, NULL is something else. We should avoid conflating these two uses.
The failures notwithstanding, the duckdb behavior looks correct to me in this case: null inputs produce null outputs, NULL doesn't mean "no argument".
The representation of "argument not provided" is unrelated to whether an input is NULL.
hmmm, I think you are right. We are victims of the SQL standard here. I don't really like this footgun:
-
ibis.literal("abcde").substr(2, None)
gives "cde", but -
ibis.literal("abcde").substr(2, ibis.null())
givesNULL
The only way to get around this would be if we did alength = length.fillnull(arg.length())
during compilation, but then this would be unexpected for more experienced SQL users, and less performant. Maybe just better documentation? This same thing exists for many other ops (eg StringFind with the optionalstart
andend
), but probably these are less used.
I think there are other approaches to the problem. I'll state the problem so that we have it written somewhere.
Ibis uses None
for two things: 1) the default value for optional arguments and 2) a value that users can use for SQL's NULL
.
Users explicitly passing None
expect it to behave like they explicitly passed ibis.null()
or an equivalent such as ibis.NA
.
Since None
is being co-opted for use as the "no-argument" sentinel value, Ibis interprets it that way instead of as NULL
, leading to a divergence in behavior between the user-facing meaning of None
versus how we use it internally.
One approach that will break user code, but is probably the right approach IMO is to use some other sentinel value to mean "argument wasn't provided".
Bit of a hack, but fairly common is to create a dummy object like NO_ARGUMENT = object()
, set that to the default value wherever None
is used, and then use arg is NO_ARGUMENT
to check for whether an argument was priovided.
I find the NO_ARGUMENT solution fairly good, I think it might be worth going through a deprecation cycle to get this right.
wait, would this then mean that for ibis.literal("abcde").substr(2, None)
, the None gets cast to ibis.null()
and the result is therefore NULL? I think this actually might be an even worse footgun than what we have now.
I'm looking into tackling this one, and while experimenting on duckdb I noticed that:
- when no arg is provided for the length, the default is going to the end. Currently this matches Ibis.
select substring('Hello', 2);
┌────────────────────────────┐
│ main.substring('Hello', 2) │
│ varchar │
├────────────────────────────┤
│ ello │
└────────────────────────────┘
- If you provide
NULL
you get a blank result/NULL which also matches the way Ibis is behaving.
select substring('Hello', 2, NULL);
┌──────────────────────────────────┐
│ main.substring('Hello', 2, NULL) │
│ varchar │
├──────────────────────────────────┤
│ │
└──────────────────────────────────┘
- Clearly you can't pass 'None' as an argument in DuckDB.
To answer:
wait, would this then mean that for ibis.literal("abcde").substr(2, None), the None gets cast to ibis.null() and the result is therefore NULL? I think this actually might be an even worse footgun than what we have now.
I think, that would be the behavior. Since if someone purposely passes None
, to mean that would mean they want to pass NULL.
Unless I'm missing something here. But I agree that the way that the docs read here can be confusing. https://ibis-project.org/reference/expression-strings.html#ibis.expr.types.strings.StringValue.substr
It could be misunderstood as if None
is explicitly passed that means that they'll get the entire remainder of the string.
I think this is something we can fix with docs, or more examples in the docstrings.
Naty, your understanding of the above is correct.
I actually think the current length=None
behavior makes sense, and is certainly easier to document than adding a new sentinel object (which programmatic users would have to import, and we would also have to expose the type of to properly type annotate a method). There are also some APIs that have None
as an option that does mean something (e.g. keep=None
in distinct
).
If we do decide to go with a new sentinel value to represent no argument provided, we'd want to do it in a bit more elaborate way to properly support type annotations and mypy
. This unfortunately a bit verbose, but is the recommended pattern for now.
import enum
# Other good names might be UNDEFINED, NO_ARGUMENT, or NO_VALUE.
# I have a slight preference for a name lacking an underscore, but
# no strong thoughts.
class UnsetType(enum.Enum):
UNSET = "UNSET"
# These methods aren't strictly needed, but make for nicer reprs of when
# introspecting function signatures
def __str__(self):
return "UNSET"
__repr__ = __str__
UNSET = UnsetType.UNSET
def my_function(x: int, y: int | UnsetType = UNSET) -> int:
if y is not UNSET:
return x + y
return x + 1
Anyway, my preference would be to fix this by updating docstrings as needed, but I wouldn't be opposed to switching to a different sentinel for no argument. IF we do that, I'd vote we do it as a breaking change rather than trying to deprecate explicit None
usage in those cases, since managing all that code sounds a bit tricky and would only probably matter to users using ibis in a programmatic manner.
I was looking into this again on the Ibis side and I think now I'm more inclined to think about the "UNSET"
implementation, because from the user perspective, this behavior below is very confusing:
In [1]: import ibis
In [2]: ibis.options.interactive = True
In [3]: ibis.literal("abcde").substr(2)
Out[3]: 'cde'
In [4]: ibis.literal("abcde").substr(2, None)
Out[4]: 'cde'
In [5]: ibis.literal(1).nullif(1)
Out[5]: None
In [6]: ibis.literal("abcde").substr(2, ibis.literal(1).nullif(1))
Out[6]: None
I can't find a good way of explaining this behavior in the docs. I think switching to a new no argument
definition is the way to go here. ~~I'll open a PR with this~~
EDIT: TIL that Out[5]
is not a python None
. Having the repr making this clear would solve this problem in my opinion.
@NickCrews - from reading the above (and also talking to @ncclementi just now about her example above), I'm wondering if part of this stems from our interactive repr not distinguishing scalars from their evaluated results. Using Naty's example above:
This line reprs as None
, but it isn't None
(it's actually an ibis scalar value), we just display it that way.
In [5]: ibis.literal(1).nullif(1)
Out[5]: None
Then when it's passed to create a larger expression, you may expect it to behave the same as if None
actually was passed in (reasonably so, since it was repr'd that way). This leads to an unexpected result, since the length
arg isn't actually None
, we just displayed it that way.
In [6]: ibis.literal("abcde").substr(2, ibis.literal(1).nullif(1))
Out[6]: None
If instead we updated the repr so that we better distinguished ibis values from their evaluated in-memory counterparts, perhaps this confusion would be avoided. For example, say cell 5 repr'd this way instead:
In [5]: ibis.literal(1).nullif(1)
Out[5]:
┌──────┐
│ NULL │
└──────┘
In this case it's clear that Out[5]
isn't a None
value, so there's less reason to believe it would behave the same as passing in length=None
to the substr
call. See #9247 for more discussion.
I agree that the repr is a bit confusing, but I think that is orthogonal.
I want passing a python None and a runtime NULL to behave the same. This is a little different from SQL, but I think it is more ergonomic, for example for some rows selecting the entire rest of the string, but for others limiting the length. This is actually a larger issue, for example if we did this then we would also want to change the behavior of StringValue.find(), Value.between(), and maybe others to match
To be concrete, I would like these to be the semantics:
import ibis
ibis.options.interactive = True
t = ibis.memtable({"a": ["abc", "def"]})
t = t.mutate(
# I want ["bc", "ef"].
t.a.substr(1).name("implicit"),
# I want ["bc", "ef"]
# Has some compile time optimization, generates the SQL SUBSTR(a, 1)
t.a.substr(1, None).name("explicit_none"),
# I want ["bc", "ef"]
# Has some compile time optimization, generates the SQL SUBSTR(a, 1)
t.a.substr(1, ibis.null()).name("explicit_null"),
# I want ["b", "ef"]
# Since we can't know the NULLness of length at compile time, we have to do
# SUBSTR(a, 1, COALESCE(<length>, LENGTH(a))
t.a.substr(1, t.a.contains("a").ifelse(1, None)).name("runtime"),
)
t
Would this be possible?
Making null behave like a default value is not consistent with basically all of the rest of Ibis, so I don't think we can implement this behavior.
NULL and None can both behave like NULL, that's pretty much the only way I can see to make everything consistent.
t.a.substr(1, ibis.null()).name("explicit_null"),
# I want ["b", "ef"]
This currently, as is, doesn't work, it'll error with TypeError because substr
expects length: int | ir.IntegerValue | None = None,
. But if it were to be changed to match @NickCrews asked behavior, that would be incompatible to what duckdb
is doing when it gets a NULL
passed. I think having a different behavior than the backend can be the root of a lot of problems.
This PR https://github.com/ibis-project/ibis/pull/9265 was merged, and hopefully it brings more clarity to the original confusion
I'd be inclined to close this issue. Any reservations?