spark-snowflake "IsIn" pushdown for empty sequence generates invalid SQL

Hello.

When using filter like col("x").isin(empty_set) where empty_set variable is an empty sequence, the invalid SQL will be generated, something similar to select * from table1 where x in ().

Snowflake will throw an exception trying to parse the statement. It seems that the correct approach should be just generating false in place of such subquery, like Spark does for its JDBC connector - https://github.com/apache/spark/pull/15977/files (also for consistency reasons).

Best regards, Evgenii Ignatev.

Jul 19 '20 18:07 YevIgn

Created a PR to demonstrate my suggestion - https://github.com/snowflakedb/spark-snowflake/pull/252

Jul 19 '20 18:07 YevIgn

@YevIgn Can you provide an example for this? I tried

val emptyList = Seq[String]()
tmpDF.select("value").filter(tmpDF.col("name").isin(emptyList:_*))

The generated sql looks like this select * from test_table_in_set where 1 = 0

Aug 07 '20 17:08 sfc-gh-zli

@sfc-gh-zli Hello, Actual code is quite large and interconnected, looks like can be outlined as roughly:

col_vals_set = set() # Set can be computed empty depending on the previous code. df.filter(F.col("some_str_col").isin(col_vals_set)).collect()

We use PySpark, Python 3.7.5, Spark 2.3.2, spark-snowflake 2.7.0 (haven't tested yet with newer version). I haven't tried if it works with Scala code.

Aug 07 '20 18:08 YevIgn

Also in my example empty set is used directly, not list.

Aug 07 '20 18:08 YevIgn

@sfc-gh-ema @sfc-gh-bli Snowflake doesn't support col IN () syntax. But it looks spark has already optimized col in(). It looks the user case the PR mentioned may not happen.

  test("Test in empty") {
    SnowflakeConnectorUtils.disablePushdownSession(sparkSession)
    val loadedDf = sparkSession.read
      .format(SNOWFLAKE_SOURCE_NAME)
      .options(connectorOptionsNoTable)
      .option("dbtable", s"$test_table")
      .option("autopushdown", "off")
      .load()
      // .filter(col("i").isInCollection(Seq(2, 3)))
      .filter(col("s").isNotNull and col("i").isInCollection(Seq.empty[Int]))
      .sort("i")
    loadedDf.show()
  }

Aug 16 '22 19:08 sfc-gh-mrui

@emlyn Thanks to report this issue. We tried to reproduce this issue. but it looks Spark has already optimized col IN () as where 0 = 1. So SC may not need to handle it. Could you please retry to see whether it works for you?

Aug 17 '22 18:08 sfc-gh-mrui

spark-snowflake spark-snowflake copied to clipboard

"IsIn" pushdown for empty sequence generates invalid SQL

spark-snowflake
spark-snowflake copied to clipboard