ibis icon indicating copy to clipboard operation
ibis copied to clipboard

bug: MSSQL syntax error

Open albersonmiranda opened this issue 8 months ago • 9 comments

What happened?

Nested mutate() generates invalid SQL query. See reprex adapted from https://github.com/posit-dev/pointblank/blob/main/pointblank/_interrogation.py#L91-L152:

# %%
import ibis
import pandas as pd

ibis.options.interactive = True

# dataframe
data = {"col1": [1, 2, 3], "col2": ["A", "B", "C"]}
df = pd.DataFrame(data)

# mssql con
con = ibis.mssql.connect(
    user=input("User: "),
    password=input("Password: "),
    host=input("Host: "),
    driver="SQL Server",
    database=input("db: ")
)

con.create_table("test_table", df)
t = con.table("test_table")

na_pass = False
column = "col1"
compare = 0

# %%
tbl = t.mutate(
    pb_is_good_1=t[column].isnull() & ibis.literal(na_pass),
    pb_is_good_2=t[column] > ibis.literal(compare),
)

tbl = tbl.mutate(pb_is_good_2=ibis.ifelse(tbl.pb_is_good_2.notnull(), tbl.pb_is_good_2, False))

result_tbl = tbl.mutate(pb_is_good_=tbl.pb_is_good_1 | tbl.pb_is_good_2).drop(
    "pb_is_good_1", "pb_is_good_2"
)

sql_query = ibis.to_sql(result_tbl)
print(sql_query)
# %%

Is this a bug or misuse of mutate()?

What version of ibis are you using?

10.3.1

What backend(s) are you using, if any?

MSSQL

Relevant log output

Here's SQL generated:


SELECT
  [t1].[col1],
  [t1].[col2],
  IIF([t1].[pb_is_good_1] <> 0 OR [t1].[pb_is_good_2] <> 0, 1, 0) AS [pb_is_good_]
FROM (
  SELECT
    [t0].[col1],
    [t0].[col2],
    (
      [t0].[col1] IS NULL -------------------------------------- Invalid syntax
    ) AND (1 = 0) AS [pb_is_good_1],---------------------------- Invalid syntax
    IIF((
      [t0].[col1] > 0
    ) IS NOT NULL, [t0].[col1] > 0, (1 = 0)) AS [pb_is_good_2]
  FROM [test_table] AS [t0]
) AS [t1]

Code of Conduct

  • [x] I agree to follow this project's Code of Conduct

albersonmiranda avatar Mar 19 '25 19:03 albersonmiranda

Hey @albersonmiranda -- can you confirm that Ibis version number? ~3.3.0 is quite old~ I don't believe there is a version 3.3.0

gforsyth avatar Mar 19 '25 21:03 gforsyth

@gforsyth I'm sorry, user mistake (I've previously installed ibis). Version is 2.1.1. Error still occurs in a fresh venv.

albersonmiranda avatar Mar 19 '25 23:03 albersonmiranda

@albersonmiranda -- that version is several years old. Current release is 10.3.1 -- is there something in your environment preventing an upgrade?

gforsyth avatar Mar 20 '25 12:03 gforsyth

Took from wrong venv again... @gforsyth here's correct output:

(.venv) PS C:\Users\030075675\Documents\bees_projects\ghm_banestes> python -V
Python 3.12.7
(.venv) PS C:\Users\030075675\Documents\bees_projects\ghm_banestes> pip show ibis-framework
Name: ibis-framework
Version: 10.3.1
Summary: The portable Python dataframe library
Home-page: https://ibis-project.org
Author:
Author-email: Ibis Maintainers <[email protected]>
License: Apache-2.0
Location: C:\Users\030075675\Documents\bees_projects\ghm_banestes\.venv\Lib\site-packages
Requires: atpublic, parsy, python-dateutil, sqlglot, toolz, typing-extensions, tzdata
Required-by:

And query again:

print(sql_query)

SELECT
  [t1].[col1],
  [t1].[col2],
  IIF([t1].[pb_is_good_1] <> 0 OR [t1].[pb_is_good_2] <> 0, 1, 0) AS [pb_is_good_]
FROM (
  SELECT
    [t0].[col1],
    [t0].[col2],
    (
      [t0].[col1] IS NULL
    ) AND (1 = 0) AS [pb_is_good_1],
    IIF((
      [t0].[col1] > 0
    ) IS NOT NULL, [t0].[col1] > 0, (1 = 0)) AS [pb_is_good_2]
  FROM [test_table] AS [t0]
) AS [t1]

albersonmiranda avatar Mar 20 '25 12:03 albersonmiranda

Thanks @albersonmiranda ! This does look like a bug and a moderately gnarly one.

I don't think your usage is incorrect.

A simplified example shows that we're initially doing the right thing:

[ins] In [42]: t = ibis.table({'a': 'str', 'b': 'str'})

[ins] In [43]: expr = t.mutate(c=t.a.isnull() & ibis.literal(False))

[nav] In [44]: ibis.to_sql(expr, dialect="mssql")
Out[44]: 
SELECT
  [t0].[a],
  [t0].[b],
  IIF((
    [t0].[a] IS NULL
  ) AND (1 = 0), 1, 0) AS [c]
FROM [unbound_table_3] AS [t0]

but when we then use an ifelse and generate a subquery, but the syntax inside the subquery is incorrect.

[ins] In [45]: expr2 = expr.mutate(d=ibis.ifelse(expr.c.notnull(), expr.c, False))

[ins] In [46]: ibis.to_sql(expr2, dialect="mssql")
Out[46]: 
SELECT
  [t1].[a],
  [t1].[b],
  IIF([t1].[c] <> 0, 1, 0) AS [c],
  IIF(IIF([t1].[c] IS NOT NULL, [t1].[c], (1 = 0)), 1, 0) AS [d]
FROM (
  SELECT
    [t0].[a],
    [t0].[b],
    (
      [t0].[a] IS NULL
    ) AND (1 = 0) AS [c]
  FROM [unbound_table_3] AS [t0]
) AS [t1]

gforsyth avatar Mar 20 '25 13:03 gforsyth

@gforsyth I see you've got a pretty big backlog. Any pointers on which file the bug might be in, so I can take a shot at fixing it?

albersonmiranda avatar Mar 20 '25 15:03 albersonmiranda

@albersonmiranda -- I think this is likely a pretty deep rabbit hole. I took a poke at it, and in the MSSQL compiler, we call _to_sqlglot to convert the ibis expression to a sqlglot expression:

https://github.com/ibis-project/ibis/blob/main/ibis/backends/sql/compilers/mssql.py#L159-L178

as a part of that, we run this conversion code:

        conversions = {
            name: ibis.ifelse(table_expr[name], 1, 0).cast(dt.boolean)
            for name, typ in table_expr.schema().items()
            if typ.is_boolean()
        }

to convert any boolean types to explicit ifelse because mssql has terrible handling for booleans.

I don't think the issue is actually here, but if we compare the table expression after those conversions are applied, there's an indication of the problem:

r0 := UnboundTable: unbound_table_0
  a string
  b string

r1 := Project[r0]
  a: r0.a
  b: r0.b
  c: IsNull(r0.a) & False

r2 := Project[r1]
  a: r1.a
  b: r1.b
  c: r1.c
  d: IfElse(bool_expr=NotNull(r1.c), true_expr=r1.c, false_null_expr=False)

Project[r2]
  a: r2.a
  b: r2.b
  c: Cast(IfElse(bool_expr=r2.c, true_expr=1, false_null_expr=0), to=boolean)
  d: Cast(IfElse(bool_expr=r2.d, true_expr=1, false_null_expr=0), to=boolean)

Note that the final projection (r2) includes Cast(..., to=boolean) around the IfElse and that is being correctly transpiled -- however, the IfElse inside the subquery (r1) is not explicitly cast to a bool.

I'm fairly sure that's the problem -- I suspect we may need an explicit rule in the mssql compiler to handle the IfElse node and ensure that it is wrapped in an explicit cast.

gforsyth avatar Mar 20 '25 23:03 gforsyth

@gforsyth Thanks for the detailed instructions! I'm getting close but not there yet:

def visit_IfElse(self, op, *, bool_expr, true_expr, false_null_expr):
        if isinstance(bool_expr, sge.Boolean):
            processed_bool_expr = bool_expr
        else:
            is_comparison = isinstance(bool_expr, (sge.Is, sge.EQ, sge.NEQ, 
                                                sge.GT, sge.LT, sge.GTE, sge.LTE))
            processed_bool_expr = bool_expr if is_comparison else bool_expr.neq(0)

        if op.true_expr.dtype.is_boolean():
            processed_true_expr = true_expr if isinstance(true_expr, (sge.Literal, int)) else self.if_(true_expr, 1, 0)
        else:
            processed_true_expr = true_expr

        if false_null_expr is not None and op.false_null_expr is not None and op.false_null_expr.dtype.is_boolean():
            processed_false_expr = false_null_expr if isinstance(false_null_expr, (sge.Literal, int)) else self.if_(false_null_expr, 1, 0)
        else:
            processed_false_expr = false_null_expr

        return sge.If(
            this=processed_bool_expr,
            true=processed_true_expr,
            false=processed_false_expr,
        )
    
    def visit_And(self, op, *, left, right):
        result = super().visit_And(op, left=left, right=right)
        return self.if_(result, 1, 0)

    def visit_Or(self, op, *, left, right):
        result = super().visit_Or(op, left=left, right=right)
        return self.if_(result, 1, 0)

    def visit_IsNull(self, op, *, arg):
        return self.if_(arg.is_(NULL), 1, 0)
    
    def visit_NotNull(self, op, *, arg):
        return self.if_(arg.is_(NULL).not_(), 1, 0)

For my example, this generates:

SELECT
  [t1].[col1],
  [t1].[col2],
  IIF(IIF([t1].[pb_is_good_1] <> 0 OR [t1].[pb_is_good_2] <> 0, 1, 0) <> 0, 1, 0) AS [pb_is_good_]
FROM (
  SELECT
    [t0].[col1],
    [t0].[col2],
    IIF(IIF([t0].[col1] IS NULL, 1, 0) AND (1 = 0), 1, 0) AS [pb_is_good_1],
    IIF(
      IIF(NOT (
        [t0].[col1] > 0
      ) IS NULL, 1, 0) <> 0,
      IIF([t0].[col1] > 0, 1, 0),
      IIF((1 = 0), 1, 0)
    ) AS [pb_is_good_2]
  FROM [test_table] AS [t0]
) AS [t1]

albersonmiranda avatar Mar 24 '25 03:03 albersonmiranda

@albersonmiranda -- another option that might work is to add a rewrite rule to the mssql compiler

gforsyth avatar Mar 25 '25 12:03 gforsyth