ibis icon indicating copy to clipboard operation
ibis copied to clipboard

bug: `Table.filter()` returning an invalid Ibis expression for Snowflake

Open helensilva14 opened this issue 1 year ago • 8 comments

What happened?

Hi Ibis team! Our DVT team encountered an unexpected behavior for Ibis while investigating the issue https://github.com/GoogleCloudPlatform/professional-services-data-validator/issues/1115 regarding filtering not working for Snowflake, but working normally for other backends such as BigQuery and Oracle.

Given the same filter predicate for Snowflake and BigQuery, the filter method (used here in our code / Ibis code line) returning different outputs: SF splitting the predicate as a separate selection (r6) when it should be together with the fields (r5) as it happens for BQ (r5).

Could you please help us understand if this is a bug with Ibis v5.1.0 or if not, how can we better investigate it?

Snowflake output

(Pdb) filtered_table
r0 := AlchemyTable: ISSUE1115
  UNIQUE_KEY string
  COLUMN1    string
  COLUMN2    string
  COLUMN3    string
  LOAD_DT    timestamp

r1 := Selection[r0]
  selections:
    r0
    cast__column1: r0.COLUMN1
    cast__column2: r0.COLUMN2
    cast__column3: r0.COLUMN3

r2 := Selection[r1]
  selections:
    r1
    ifnull__cast__column1: IfNull(r1.cast__column1, ifnull_expr='DEFAULT_REPLACEMENT_STRING')
    ifnull__cast__column2: IfNull(r1.cast__column2, ifnull_expr='DEFAULT_REPLACEMENT_STRING')
    ifnull__cast__column3: IfNull(r1.cast__column3, ifnull_expr='DEFAULT_REPLACEMENT_STRING')

r3 := Selection[r2]
  selections:
    r2
    rstrip__ifnull__cast__column1: RStrip(r2.ifnull__cast__column1)
    rstrip__ifnull__cast__column2: RStrip(r2.ifnull__cast__column2)
    rstrip__ifnull__cast__column3: RStrip(r2.ifnull__cast__column3)

r4 := Selection[r3]
  selections:
    r3
    upper__rstrip__ifnull__cast__column1: Uppercase(r3.rstrip__ifnull__cast__column1)
    upper__rstrip__ifnull__cast__column2: Uppercase(r3.rstrip__ifnull__cast__column2)
    upper__rstrip__ifnull__cast__column3: Uppercase(r3.rstrip__ifnull__cast__column3)

r5 := Selection[r4]
  selections:
    r4
    concat__all: StringJoin(sep='', [r4.upper__rstrip__ifnull__cast__column1, r4.upper__rstrip__ifnull__cast__column2, r4.upper__rstrip__ifnull__cast__column3])

r6 := Selection[r5]
  selections:
    concat__all: r5.concat__all
    unique_key:  r5.UNIQUE_KEY

Selection[r6]
  predicates:
    RawSQL(r6.concat__all, 'date(LOAD_DT)=current_date')

BigQuery output

(Pdb) filtered_table
r0 := BigQueryTable: pso-kokoro-resources.pso_data_validator.issue1115
  unique_key string
  column1    string
  column2    string
  column3    string
  load_dt    timestamp

r1 := Selection[r0]
  selections:
    r0
    cast__column1:                                     r0.column1
    cast__column2:                                     r0.column2
    cast__column3:                                     r0.column3

r2 := Selection[r1]
  selections:
    r1
    ifnull__cast__column1: IfNull(r1.cast__column1, ifnull_expr='DEFAULT_REPLACEMENT_STRING')
    ifnull__cast__column2: IfNull(r1.cast__column2, ifnull_expr='DEFAULT_REPLACEMENT_STRING')
    ifnull__cast__column3: IfNull(r1.cast__column3, ifnull_expr='DEFAULT_REPLACEMENT_STRING')

r3 := Selection[r2]
  selections:
    r2
    rstrip__ifnull__cast__column1: RStrip(r2.ifnull__cast__column1)
    rstrip__ifnull__cast__column2: RStrip(r2.ifnull__cast__column2)
    rstrip__ifnull__cast__column3: RStrip(r2.ifnull__cast__column3)

r4 := Selection[r3]
  selections:
    r3
    upper__rstrip__ifnull__cast__column1: Uppercase(r3.rstrip__ifnull__cast__column1)
    upper__rstrip__ifnull__cast__column2: Uppercase(r3.rstrip__ifnull__cast__column2)
    upper__rstrip__ifnull__cast__column3: Uppercase(r3.rstrip__ifnull__cast__column3)

r5 := Selection[r4]
  selections:
    r4
    concat__all: StringJoin(sep='', [r4.upper__rstrip__ifnull__cast__column1, r4.upper__rstrip__ifnull__cast__column2, r4.upper__rstrip__ifnull__cast__column3])

Selection[r5]
  selections:
    concat__all: r5.concat__all
    unique_key:  r5.unique_key
  predicates:
    RawSQL(r5.concat__all, 'date(load_dt)=current_date')

Thank you in advance!

What version of ibis are you using?

5.1.0

What backend(s) are you using, if any?

Snowflake - having the issue

BigQuery - generating as expected

Relevant log output

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

helensilva14 avatar Mar 19 '24 19:03 helensilva14

Snowflake error

03/19/2024 07:29:10 PM-ERROR: Error (snowflake.connector.errors.ProgrammingError) 000904 (42000): SQL compilation error: error line 19 at position 11
invalid identifier 'LOAD_DT'
[SQL: WITH t0 AS 
(SELECT t6."UNIQUE_KEY" AS "UNIQUE_KEY", t6."COLUMN1" AS "COLUMN1", t6."COLUMN2" AS "COLUMN2", t6."COLUMN3" AS "COLUMN3", t6."LOAD_DT" AS "LOAD_DT", t6."COLUMN1" AS "cast__column1", t6."COLUMN2" AS "cast__column2", t6."COLUMN3" AS "cast__column3" 
FROM pso_data_validator.public."ISSUE1115" AS t6), 
t1 AS 
(SELECT t0."UNIQUE_KEY" AS "UNIQUE_KEY", t0."COLUMN1" AS "COLUMN1", t0."COLUMN2" AS "COLUMN2", t0."COLUMN3" AS "COLUMN3", t0."LOAD_DT" AS "LOAD_DT", t0."cast__column1" AS "cast__column1", t0."cast__column2" AS "cast__column2", t0."cast__column3" AS "cast__column3", coalesce(t0."cast__column1", %(param_2)s) AS "ifnull__cast__column1", coalesce(t0."cast__column2", %(param_3)s) AS "ifnull__cast__column2", coalesce(t0."cast__column3", %(param_4)s) AS "ifnull__cast__column3" 
FROM t0), 
t2 AS 
(SELECT t1."UNIQUE_KEY" AS "UNIQUE_KEY", t1."COLUMN1" AS "COLUMN1", t1."COLUMN2" AS "COLUMN2", t1."COLUMN3" AS "COLUMN3", t1."LOAD_DT" AS "LOAD_DT", t1."cast__column1" AS "cast__column1", t1."cast__column2" AS "cast__column2", t1."cast__column3" AS "cast__column3", t1."ifnull__cast__column1" AS "ifnull__cast__column1", t1."ifnull__cast__column2" AS "ifnull__cast__column2", t1."ifnull__cast__column3" AS "ifnull__cast__column3", rtrim(t1."ifnull__cast__column1", %(rtrim_1)s) AS "rstrip__ifnull__cast__column1", rtrim(t1."ifnull__cast__column2", %(rtrim_2)s) AS "rstrip__ifnull__cast__column2", rtrim(t1."ifnull__cast__column3", %(rtrim_3)s) AS "rstrip__ifnull__cast__column3" 
FROM t1), 
t3 AS 
(SELECT t2."UNIQUE_KEY" AS "UNIQUE_KEY", t2."COLUMN1" AS "COLUMN1", t2."COLUMN2" AS "COLUMN2", t2."COLUMN3" AS "COLUMN3", t2."LOAD_DT" AS "LOAD_DT", t2."cast__column1" AS "cast__column1", t2."cast__column2" AS "cast__column2", t2."cast__column3" AS "cast__column3", t2."ifnull__cast__column1" AS "ifnull__cast__column1", t2."ifnull__cast__column2" AS "ifnull__cast__column2", t2."ifnull__cast__column3" AS "ifnull__cast__column3", t2."rstrip__ifnull__cast__column1" AS "rstrip__ifnull__cast__column1", t2."rstrip__ifnull__cast__column2" AS "rstrip__ifnull__cast__column2", t2."rstrip__ifnull__cast__column3" AS "rstrip__ifnull__cast__column3", upper(t2."rstrip__ifnull__cast__column1") AS "upper__rstrip__ifnull__cast__column1", upper(t2."rstrip__ifnull__cast__column2") AS "upper__rstrip__ifnull__cast__column2", upper(t2."rstrip__ifnull__cast__column3") AS "upper__rstrip__ifnull__cast__column3" 
FROM t2), 
t4 AS 
(SELECT t3."UNIQUE_KEY" AS "UNIQUE_KEY", t3."COLUMN1" AS "COLUMN1", t3."COLUMN2" AS "COLUMN2", t3."COLUMN3" AS "COLUMN3", t3."LOAD_DT" AS "LOAD_DT", t3."cast__column1" AS "cast__column1", t3."cast__column2" AS "cast__column2", t3."cast__column3" AS "cast__column3", t3."ifnull__cast__column1" AS "ifnull__cast__column1", t3."ifnull__cast__column2" AS "ifnull__cast__column2", t3."ifnull__cast__column3" AS "ifnull__cast__column3", t3."rstrip__ifnull__cast__column1" AS "rstrip__ifnull__cast__column1", t3."rstrip__ifnull__cast__column2" AS "rstrip__ifnull__cast__column2", t3."rstrip__ifnull__cast__column3" AS "rstrip__ifnull__cast__column3", t3."upper__rstrip__ifnull__cast__column1" AS "upper__rstrip__ifnull__cast__column1", t3."upper__rstrip__ifnull__cast__column2" AS "upper__rstrip__ifnull__cast__column2", t3."upper__rstrip__ifnull__cast__column3" AS "upper__rstrip__ifnull__cast__column3", concat_ws(%(param_1)s, t3."upper__rstrip__ifnull__cast__column1", t3."upper__rstrip__ifnull__cast__column2", t3."upper__rstrip__ifnull__cast__column3") AS "concat__all" 
FROM t3)
 SELECT t5."concat__all", t5."unique_key" 
FROM (SELECT t4."concat__all" AS "concat__all", t4."UNIQUE_KEY" AS "unique_key" 
FROM t4) AS t5 
WHERE date(LOAD_DT)=current_date]
[parameters: {'param_1': '', 'rtrim_1': ' \t\n\r\x0b\x0c', 'rtrim_2': ' \t\n\r\x0b\x0c', 'rtrim_3': ' \t\n\r\x0b\x0c', 'param_2': 'DEFAULT_REPLACEMENT_STRING', 'param_3': 'DEFAULT_REPLACEMENT_STRING', 'param_4': 'DEFAULT_REPLACEMENT_STRING'}]

helensilva14 avatar Mar 19 '24 19:03 helensilva14

Hmmmm, without access to the original ibis expression, it's hard to say whether this has been fixed since ibis 5.1. Given that it works on other backends it's likely that it's a bug in ibis's snowflake backend, but beyond that I can't say more. Ibis 5.0 is a fairly old release (we're about to release ibis 9.0), so it's likely (but not certain) that the issue has been fixed already.

jcrist avatar Mar 19 '24 19:03 jcrist

Hi @jcrist! Makes sense and we totally understand that's difficult to identify an error in such a prior release, so we've been trying to reproduce the same query in an isolated new venv using just plain Ibis (pip install ibis-framework[snowflake]==5.1.0), using the following code and DDL:

import ibis

ibis.options.interactive = True

con = ibis.snowflake.connect(
    user="XXX",
    password="XXX",
    account="XXX",
    database="pso_data_validator/public"
)

table = con.table("ISSUE1115")

r1 = table.select('COLUMN1', 'COLUMN2', 'COLUMN3')
r6 = r1.filter(table['LOAD_DT'] == ibis.now())
print(ibis.Expr._repr(r6))
create table PSO_DATA_VALIDATOR.PUBLIC.ISSUE1115 (
   unique_key STRING,
   column1 STRING,
   column2 STRING,
   column3 STRING,
   load_dt DATETIME
);

And this is the output

r0 := AlchemyTable: ISSUE1115
  UNIQUE_KEY string
  COLUMN1    string
  COLUMN2    string
  COLUMN3    string
  LOAD_DT    timestamp

Selection[r0]
  predicates:
    r0.LOAD_DT == TimestampNow()

But I would like to get a CONCAT transformation similar to the one below and I'm not being able to get the right Python code to generate it, could you please help me with that?

r5 := Selection[r4]
  selections:
    r4
    concat__all: StringJoin(sep='', [r4.column1, r4.column2, r4.column3])

helensilva14 avatar Mar 21 '24 20:03 helensilva14

Hi, any follow-up about helping us with the example Ibis code? Thanks!

helensilva14 avatar Mar 27 '24 15:03 helensilva14

Apologies for missing this - a string join like that would be done using the following ibis code:

ibis.literal("").join([r4.column1, r4.column2, r4.column3])

was that what you were asking?

jcrist avatar Mar 27 '24 15:03 jcrist

Yes, that's what I needed, thank you! Now we're almost there on our attempt to replicate the same representation, we might have more questions soon

helensilva14 avatar Mar 27 '24 22:03 helensilva14

Hi @jcrist, happy Friday! Could you please help us to figure out what is the Ibis Python code needed to get a RawSQL predicate like this one below?

Selection[r5]
  selections:
    concat__all: r5.concat__all
    unique_key:  r5.unique_key
  predicates:
    RawSQL(r5.concat__all, 'date(load_dt)=current_date')

The initial code r6 = r5pt2.filter(table.LOAD_DT == ibis.now()) I have is generating the following

Selection[r2]
  selections:
    concat__all: r2.concat__all
  predicates:
    r0.LOAD_DT == TimestampNow()

And when I try something like this r6 = r5pt2.sql("SELECT * FROM ISSUE1115 WHERE date(LOAD_DT)=current_date") using SQL I get this error ValueError: Decimal type precision cannot be zero

Any help on how to provide a SQL expression as a filter for Snowflake will be super helpful, thanks in advance!

helensilva14 avatar Apr 12 '24 16:04 helensilva14

And @cpcloud, if you could please take a look at this scenario and provide any insights we would appreciate, thanks a lot!

helensilva14 avatar Apr 12 '24 16:04 helensilva14

Closing the issue as we figured out a code change inside DVT, thanks for the help!

helensilva14 avatar Apr 29 '24 17:04 helensilva14