ibis
ibis copied to clipboard
bug: MSSQL syntax error
What happened?
Nested mutate() generates invalid SQL query. See reprex adapted from https://github.com/posit-dev/pointblank/blob/main/pointblank/_interrogation.py#L91-L152:
# %%
import ibis
import pandas as pd
ibis.options.interactive = True
# dataframe
data = {"col1": [1, 2, 3], "col2": ["A", "B", "C"]}
df = pd.DataFrame(data)
# mssql con
con = ibis.mssql.connect(
user=input("User: "),
password=input("Password: "),
host=input("Host: "),
driver="SQL Server",
database=input("db: ")
)
con.create_table("test_table", df)
t = con.table("test_table")
na_pass = False
column = "col1"
compare = 0
# %%
tbl = t.mutate(
pb_is_good_1=t[column].isnull() & ibis.literal(na_pass),
pb_is_good_2=t[column] > ibis.literal(compare),
)
tbl = tbl.mutate(pb_is_good_2=ibis.ifelse(tbl.pb_is_good_2.notnull(), tbl.pb_is_good_2, False))
result_tbl = tbl.mutate(pb_is_good_=tbl.pb_is_good_1 | tbl.pb_is_good_2).drop(
"pb_is_good_1", "pb_is_good_2"
)
sql_query = ibis.to_sql(result_tbl)
print(sql_query)
# %%
Is this a bug or misuse of mutate()?
What version of ibis are you using?
10.3.1
What backend(s) are you using, if any?
MSSQL
Relevant log output
Here's SQL generated:
SELECT
[t1].[col1],
[t1].[col2],
IIF([t1].[pb_is_good_1] <> 0 OR [t1].[pb_is_good_2] <> 0, 1, 0) AS [pb_is_good_]
FROM (
SELECT
[t0].[col1],
[t0].[col2],
(
[t0].[col1] IS NULL -------------------------------------- Invalid syntax
) AND (1 = 0) AS [pb_is_good_1],---------------------------- Invalid syntax
IIF((
[t0].[col1] > 0
) IS NOT NULL, [t0].[col1] > 0, (1 = 0)) AS [pb_is_good_2]
FROM [test_table] AS [t0]
) AS [t1]
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
Hey @albersonmiranda -- can you confirm that Ibis version number? ~3.3.0 is quite old~ I don't believe there is a version 3.3.0
@gforsyth I'm sorry, user mistake (I've previously installed ibis). Version is 2.1.1. Error still occurs in a fresh venv.
@albersonmiranda -- that version is several years old. Current release is 10.3.1 -- is there something in your environment preventing an upgrade?
Took from wrong venv again... @gforsyth here's correct output:
(.venv) PS C:\Users\030075675\Documents\bees_projects\ghm_banestes> python -V
Python 3.12.7
(.venv) PS C:\Users\030075675\Documents\bees_projects\ghm_banestes> pip show ibis-framework
Name: ibis-framework
Version: 10.3.1
Summary: The portable Python dataframe library
Home-page: https://ibis-project.org
Author:
Author-email: Ibis Maintainers <[email protected]>
License: Apache-2.0
Location: C:\Users\030075675\Documents\bees_projects\ghm_banestes\.venv\Lib\site-packages
Requires: atpublic, parsy, python-dateutil, sqlglot, toolz, typing-extensions, tzdata
Required-by:
And query again:
print(sql_query)
SELECT
[t1].[col1],
[t1].[col2],
IIF([t1].[pb_is_good_1] <> 0 OR [t1].[pb_is_good_2] <> 0, 1, 0) AS [pb_is_good_]
FROM (
SELECT
[t0].[col1],
[t0].[col2],
(
[t0].[col1] IS NULL
) AND (1 = 0) AS [pb_is_good_1],
IIF((
[t0].[col1] > 0
) IS NOT NULL, [t0].[col1] > 0, (1 = 0)) AS [pb_is_good_2]
FROM [test_table] AS [t0]
) AS [t1]
Thanks @albersonmiranda ! This does look like a bug and a moderately gnarly one.
I don't think your usage is incorrect.
A simplified example shows that we're initially doing the right thing:
[ins] In [42]: t = ibis.table({'a': 'str', 'b': 'str'})
[ins] In [43]: expr = t.mutate(c=t.a.isnull() & ibis.literal(False))
[nav] In [44]: ibis.to_sql(expr, dialect="mssql")
Out[44]:
SELECT
[t0].[a],
[t0].[b],
IIF((
[t0].[a] IS NULL
) AND (1 = 0), 1, 0) AS [c]
FROM [unbound_table_3] AS [t0]
but when we then use an ifelse and generate a subquery, but the syntax inside the subquery is incorrect.
[ins] In [45]: expr2 = expr.mutate(d=ibis.ifelse(expr.c.notnull(), expr.c, False))
[ins] In [46]: ibis.to_sql(expr2, dialect="mssql")
Out[46]:
SELECT
[t1].[a],
[t1].[b],
IIF([t1].[c] <> 0, 1, 0) AS [c],
IIF(IIF([t1].[c] IS NOT NULL, [t1].[c], (1 = 0)), 1, 0) AS [d]
FROM (
SELECT
[t0].[a],
[t0].[b],
(
[t0].[a] IS NULL
) AND (1 = 0) AS [c]
FROM [unbound_table_3] AS [t0]
) AS [t1]
@gforsyth I see you've got a pretty big backlog. Any pointers on which file the bug might be in, so I can take a shot at fixing it?
@albersonmiranda -- I think this is likely a pretty deep rabbit hole. I took a poke at it, and in the MSSQL compiler, we call _to_sqlglot to convert the ibis expression to a sqlglot expression:
https://github.com/ibis-project/ibis/blob/main/ibis/backends/sql/compilers/mssql.py#L159-L178
as a part of that, we run this conversion code:
conversions = {
name: ibis.ifelse(table_expr[name], 1, 0).cast(dt.boolean)
for name, typ in table_expr.schema().items()
if typ.is_boolean()
}
to convert any boolean types to explicit ifelse because mssql has terrible handling for booleans.
I don't think the issue is actually here, but if we compare the table expression after those conversions are applied, there's an indication of the problem:
r0 := UnboundTable: unbound_table_0
a string
b string
r1 := Project[r0]
a: r0.a
b: r0.b
c: IsNull(r0.a) & False
r2 := Project[r1]
a: r1.a
b: r1.b
c: r1.c
d: IfElse(bool_expr=NotNull(r1.c), true_expr=r1.c, false_null_expr=False)
Project[r2]
a: r2.a
b: r2.b
c: Cast(IfElse(bool_expr=r2.c, true_expr=1, false_null_expr=0), to=boolean)
d: Cast(IfElse(bool_expr=r2.d, true_expr=1, false_null_expr=0), to=boolean)
Note that the final projection (r2) includes Cast(..., to=boolean) around the IfElse and that is being correctly transpiled -- however, the IfElse inside the subquery (r1) is not explicitly cast to a bool.
I'm fairly sure that's the problem -- I suspect we may need an explicit rule in the mssql compiler to handle the IfElse node and ensure that it is wrapped in an explicit cast.
@gforsyth Thanks for the detailed instructions! I'm getting close but not there yet:
def visit_IfElse(self, op, *, bool_expr, true_expr, false_null_expr):
if isinstance(bool_expr, sge.Boolean):
processed_bool_expr = bool_expr
else:
is_comparison = isinstance(bool_expr, (sge.Is, sge.EQ, sge.NEQ,
sge.GT, sge.LT, sge.GTE, sge.LTE))
processed_bool_expr = bool_expr if is_comparison else bool_expr.neq(0)
if op.true_expr.dtype.is_boolean():
processed_true_expr = true_expr if isinstance(true_expr, (sge.Literal, int)) else self.if_(true_expr, 1, 0)
else:
processed_true_expr = true_expr
if false_null_expr is not None and op.false_null_expr is not None and op.false_null_expr.dtype.is_boolean():
processed_false_expr = false_null_expr if isinstance(false_null_expr, (sge.Literal, int)) else self.if_(false_null_expr, 1, 0)
else:
processed_false_expr = false_null_expr
return sge.If(
this=processed_bool_expr,
true=processed_true_expr,
false=processed_false_expr,
)
def visit_And(self, op, *, left, right):
result = super().visit_And(op, left=left, right=right)
return self.if_(result, 1, 0)
def visit_Or(self, op, *, left, right):
result = super().visit_Or(op, left=left, right=right)
return self.if_(result, 1, 0)
def visit_IsNull(self, op, *, arg):
return self.if_(arg.is_(NULL), 1, 0)
def visit_NotNull(self, op, *, arg):
return self.if_(arg.is_(NULL).not_(), 1, 0)
For my example, this generates:
SELECT
[t1].[col1],
[t1].[col2],
IIF(IIF([t1].[pb_is_good_1] <> 0 OR [t1].[pb_is_good_2] <> 0, 1, 0) <> 0, 1, 0) AS [pb_is_good_]
FROM (
SELECT
[t0].[col1],
[t0].[col2],
IIF(IIF([t0].[col1] IS NULL, 1, 0) AND (1 = 0), 1, 0) AS [pb_is_good_1],
IIF(
IIF(NOT (
[t0].[col1] > 0
) IS NULL, 1, 0) <> 0,
IIF([t0].[col1] > 0, 1, 0),
IIF((1 = 0), 1, 0)
) AS [pb_is_good_2]
FROM [test_table] AS [t0]
) AS [t1]
@albersonmiranda -- another option that might work is to add a rewrite rule to the mssql compiler