ibis
ibis copied to clipboard
bug: `Table.filter()` returning an invalid Ibis expression for Snowflake
What happened?
Hi Ibis team! Our DVT team encountered an unexpected behavior for Ibis while investigating the issue https://github.com/GoogleCloudPlatform/professional-services-data-validator/issues/1115 regarding filtering not working for Snowflake, but working normally for other backends such as BigQuery and Oracle.
Given the same filter predicate for Snowflake and BigQuery, the filter method (used here in our code / Ibis code line) returning different outputs: SF splitting the predicate as a separate selection (r6) when it should be together with the fields (r5) as it happens for BQ (r5).
Could you please help us understand if this is a bug with Ibis v5.1.0 or if not, how can we better investigate it?
Snowflake output
(Pdb) filtered_table
r0 := AlchemyTable: ISSUE1115
UNIQUE_KEY string
COLUMN1 string
COLUMN2 string
COLUMN3 string
LOAD_DT timestamp
r1 := Selection[r0]
selections:
r0
cast__column1: r0.COLUMN1
cast__column2: r0.COLUMN2
cast__column3: r0.COLUMN3
r2 := Selection[r1]
selections:
r1
ifnull__cast__column1: IfNull(r1.cast__column1, ifnull_expr='DEFAULT_REPLACEMENT_STRING')
ifnull__cast__column2: IfNull(r1.cast__column2, ifnull_expr='DEFAULT_REPLACEMENT_STRING')
ifnull__cast__column3: IfNull(r1.cast__column3, ifnull_expr='DEFAULT_REPLACEMENT_STRING')
r3 := Selection[r2]
selections:
r2
rstrip__ifnull__cast__column1: RStrip(r2.ifnull__cast__column1)
rstrip__ifnull__cast__column2: RStrip(r2.ifnull__cast__column2)
rstrip__ifnull__cast__column3: RStrip(r2.ifnull__cast__column3)
r4 := Selection[r3]
selections:
r3
upper__rstrip__ifnull__cast__column1: Uppercase(r3.rstrip__ifnull__cast__column1)
upper__rstrip__ifnull__cast__column2: Uppercase(r3.rstrip__ifnull__cast__column2)
upper__rstrip__ifnull__cast__column3: Uppercase(r3.rstrip__ifnull__cast__column3)
r5 := Selection[r4]
selections:
r4
concat__all: StringJoin(sep='', [r4.upper__rstrip__ifnull__cast__column1, r4.upper__rstrip__ifnull__cast__column2, r4.upper__rstrip__ifnull__cast__column3])
r6 := Selection[r5]
selections:
concat__all: r5.concat__all
unique_key: r5.UNIQUE_KEY
Selection[r6]
predicates:
RawSQL(r6.concat__all, 'date(LOAD_DT)=current_date')
BigQuery output
(Pdb) filtered_table
r0 := BigQueryTable: pso-kokoro-resources.pso_data_validator.issue1115
unique_key string
column1 string
column2 string
column3 string
load_dt timestamp
r1 := Selection[r0]
selections:
r0
cast__column1: r0.column1
cast__column2: r0.column2
cast__column3: r0.column3
r2 := Selection[r1]
selections:
r1
ifnull__cast__column1: IfNull(r1.cast__column1, ifnull_expr='DEFAULT_REPLACEMENT_STRING')
ifnull__cast__column2: IfNull(r1.cast__column2, ifnull_expr='DEFAULT_REPLACEMENT_STRING')
ifnull__cast__column3: IfNull(r1.cast__column3, ifnull_expr='DEFAULT_REPLACEMENT_STRING')
r3 := Selection[r2]
selections:
r2
rstrip__ifnull__cast__column1: RStrip(r2.ifnull__cast__column1)
rstrip__ifnull__cast__column2: RStrip(r2.ifnull__cast__column2)
rstrip__ifnull__cast__column3: RStrip(r2.ifnull__cast__column3)
r4 := Selection[r3]
selections:
r3
upper__rstrip__ifnull__cast__column1: Uppercase(r3.rstrip__ifnull__cast__column1)
upper__rstrip__ifnull__cast__column2: Uppercase(r3.rstrip__ifnull__cast__column2)
upper__rstrip__ifnull__cast__column3: Uppercase(r3.rstrip__ifnull__cast__column3)
r5 := Selection[r4]
selections:
r4
concat__all: StringJoin(sep='', [r4.upper__rstrip__ifnull__cast__column1, r4.upper__rstrip__ifnull__cast__column2, r4.upper__rstrip__ifnull__cast__column3])
Selection[r5]
selections:
concat__all: r5.concat__all
unique_key: r5.unique_key
predicates:
RawSQL(r5.concat__all, 'date(load_dt)=current_date')
Thank you in advance!
What version of ibis are you using?
5.1.0
What backend(s) are you using, if any?
Snowflake - having the issue
BigQuery - generating as expected
Relevant log output
No response
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Snowflake error
03/19/2024 07:29:10 PM-ERROR: Error (snowflake.connector.errors.ProgrammingError) 000904 (42000): SQL compilation error: error line 19 at position 11
invalid identifier 'LOAD_DT'
[SQL: WITH t0 AS
(SELECT t6."UNIQUE_KEY" AS "UNIQUE_KEY", t6."COLUMN1" AS "COLUMN1", t6."COLUMN2" AS "COLUMN2", t6."COLUMN3" AS "COLUMN3", t6."LOAD_DT" AS "LOAD_DT", t6."COLUMN1" AS "cast__column1", t6."COLUMN2" AS "cast__column2", t6."COLUMN3" AS "cast__column3"
FROM pso_data_validator.public."ISSUE1115" AS t6),
t1 AS
(SELECT t0."UNIQUE_KEY" AS "UNIQUE_KEY", t0."COLUMN1" AS "COLUMN1", t0."COLUMN2" AS "COLUMN2", t0."COLUMN3" AS "COLUMN3", t0."LOAD_DT" AS "LOAD_DT", t0."cast__column1" AS "cast__column1", t0."cast__column2" AS "cast__column2", t0."cast__column3" AS "cast__column3", coalesce(t0."cast__column1", %(param_2)s) AS "ifnull__cast__column1", coalesce(t0."cast__column2", %(param_3)s) AS "ifnull__cast__column2", coalesce(t0."cast__column3", %(param_4)s) AS "ifnull__cast__column3"
FROM t0),
t2 AS
(SELECT t1."UNIQUE_KEY" AS "UNIQUE_KEY", t1."COLUMN1" AS "COLUMN1", t1."COLUMN2" AS "COLUMN2", t1."COLUMN3" AS "COLUMN3", t1."LOAD_DT" AS "LOAD_DT", t1."cast__column1" AS "cast__column1", t1."cast__column2" AS "cast__column2", t1."cast__column3" AS "cast__column3", t1."ifnull__cast__column1" AS "ifnull__cast__column1", t1."ifnull__cast__column2" AS "ifnull__cast__column2", t1."ifnull__cast__column3" AS "ifnull__cast__column3", rtrim(t1."ifnull__cast__column1", %(rtrim_1)s) AS "rstrip__ifnull__cast__column1", rtrim(t1."ifnull__cast__column2", %(rtrim_2)s) AS "rstrip__ifnull__cast__column2", rtrim(t1."ifnull__cast__column3", %(rtrim_3)s) AS "rstrip__ifnull__cast__column3"
FROM t1),
t3 AS
(SELECT t2."UNIQUE_KEY" AS "UNIQUE_KEY", t2."COLUMN1" AS "COLUMN1", t2."COLUMN2" AS "COLUMN2", t2."COLUMN3" AS "COLUMN3", t2."LOAD_DT" AS "LOAD_DT", t2."cast__column1" AS "cast__column1", t2."cast__column2" AS "cast__column2", t2."cast__column3" AS "cast__column3", t2."ifnull__cast__column1" AS "ifnull__cast__column1", t2."ifnull__cast__column2" AS "ifnull__cast__column2", t2."ifnull__cast__column3" AS "ifnull__cast__column3", t2."rstrip__ifnull__cast__column1" AS "rstrip__ifnull__cast__column1", t2."rstrip__ifnull__cast__column2" AS "rstrip__ifnull__cast__column2", t2."rstrip__ifnull__cast__column3" AS "rstrip__ifnull__cast__column3", upper(t2."rstrip__ifnull__cast__column1") AS "upper__rstrip__ifnull__cast__column1", upper(t2."rstrip__ifnull__cast__column2") AS "upper__rstrip__ifnull__cast__column2", upper(t2."rstrip__ifnull__cast__column3") AS "upper__rstrip__ifnull__cast__column3"
FROM t2),
t4 AS
(SELECT t3."UNIQUE_KEY" AS "UNIQUE_KEY", t3."COLUMN1" AS "COLUMN1", t3."COLUMN2" AS "COLUMN2", t3."COLUMN3" AS "COLUMN3", t3."LOAD_DT" AS "LOAD_DT", t3."cast__column1" AS "cast__column1", t3."cast__column2" AS "cast__column2", t3."cast__column3" AS "cast__column3", t3."ifnull__cast__column1" AS "ifnull__cast__column1", t3."ifnull__cast__column2" AS "ifnull__cast__column2", t3."ifnull__cast__column3" AS "ifnull__cast__column3", t3."rstrip__ifnull__cast__column1" AS "rstrip__ifnull__cast__column1", t3."rstrip__ifnull__cast__column2" AS "rstrip__ifnull__cast__column2", t3."rstrip__ifnull__cast__column3" AS "rstrip__ifnull__cast__column3", t3."upper__rstrip__ifnull__cast__column1" AS "upper__rstrip__ifnull__cast__column1", t3."upper__rstrip__ifnull__cast__column2" AS "upper__rstrip__ifnull__cast__column2", t3."upper__rstrip__ifnull__cast__column3" AS "upper__rstrip__ifnull__cast__column3", concat_ws(%(param_1)s, t3."upper__rstrip__ifnull__cast__column1", t3."upper__rstrip__ifnull__cast__column2", t3."upper__rstrip__ifnull__cast__column3") AS "concat__all"
FROM t3)
SELECT t5."concat__all", t5."unique_key"
FROM (SELECT t4."concat__all" AS "concat__all", t4."UNIQUE_KEY" AS "unique_key"
FROM t4) AS t5
WHERE date(LOAD_DT)=current_date]
[parameters: {'param_1': '', 'rtrim_1': ' \t\n\r\x0b\x0c', 'rtrim_2': ' \t\n\r\x0b\x0c', 'rtrim_3': ' \t\n\r\x0b\x0c', 'param_2': 'DEFAULT_REPLACEMENT_STRING', 'param_3': 'DEFAULT_REPLACEMENT_STRING', 'param_4': 'DEFAULT_REPLACEMENT_STRING'}]
Hmmmm, without access to the original ibis expression, it's hard to say whether this has been fixed since ibis 5.1. Given that it works on other backends it's likely that it's a bug in ibis's snowflake backend, but beyond that I can't say more. Ibis 5.0 is a fairly old release (we're about to release ibis 9.0), so it's likely (but not certain) that the issue has been fixed already.
Hi @jcrist! Makes sense and we totally understand that's difficult to identify an error in such a prior release, so we've been trying to reproduce the same query in an isolated new venv using just plain Ibis (pip install ibis-framework[snowflake]==5.1.0), using the following code and DDL:
import ibis
ibis.options.interactive = True
con = ibis.snowflake.connect(
user="XXX",
password="XXX",
account="XXX",
database="pso_data_validator/public"
)
table = con.table("ISSUE1115")
r1 = table.select('COLUMN1', 'COLUMN2', 'COLUMN3')
r6 = r1.filter(table['LOAD_DT'] == ibis.now())
print(ibis.Expr._repr(r6))
create table PSO_DATA_VALIDATOR.PUBLIC.ISSUE1115 (
unique_key STRING,
column1 STRING,
column2 STRING,
column3 STRING,
load_dt DATETIME
);
And this is the output
r0 := AlchemyTable: ISSUE1115
UNIQUE_KEY string
COLUMN1 string
COLUMN2 string
COLUMN3 string
LOAD_DT timestamp
Selection[r0]
predicates:
r0.LOAD_DT == TimestampNow()
But I would like to get a CONCAT transformation similar to the one below and I'm not being able to get the right Python code to generate it, could you please help me with that?
r5 := Selection[r4]
selections:
r4
concat__all: StringJoin(sep='', [r4.column1, r4.column2, r4.column3])
Hi, any follow-up about helping us with the example Ibis code? Thanks!
Apologies for missing this - a string join like that would be done using the following ibis code:
ibis.literal("").join([r4.column1, r4.column2, r4.column3])
was that what you were asking?
Yes, that's what I needed, thank you! Now we're almost there on our attempt to replicate the same representation, we might have more questions soon
Hi @jcrist, happy Friday! Could you please help us to figure out what is the Ibis Python code needed to get a RawSQL predicate like this one below?
Selection[r5]
selections:
concat__all: r5.concat__all
unique_key: r5.unique_key
predicates:
RawSQL(r5.concat__all, 'date(load_dt)=current_date')
The initial code r6 = r5pt2.filter(table.LOAD_DT == ibis.now()) I have is generating the following
Selection[r2]
selections:
concat__all: r2.concat__all
predicates:
r0.LOAD_DT == TimestampNow()
And when I try something like this r6 = r5pt2.sql("SELECT * FROM ISSUE1115 WHERE date(LOAD_DT)=current_date") using SQL I get this error ValueError: Decimal type precision cannot be zero
Any help on how to provide a SQL expression as a filter for Snowflake will be super helpful, thanks in advance!
And @cpcloud, if you could please take a look at this scenario and provide any insights we would appreciate, thanks a lot!
Closing the issue as we figured out a code change inside DVT, thanks for the help!