spark-snowflake
spark-snowflake copied to clipboard
"IsIn" pushdown for empty sequence generates invalid SQL
Hello.
When using filter like col("x").isin(empty_set)
where empty_set
variable is an empty sequence, the invalid SQL will be generated, something similar to select * from table1 where x in ()
.
Snowflake will throw an exception trying to parse the statement. It seems that the correct approach should be just generating false
in place of such subquery, like Spark does for its JDBC connector - https://github.com/apache/spark/pull/15977/files (also for consistency reasons).
Best regards, Evgenii Ignatev.
Created a PR to demonstrate my suggestion - https://github.com/snowflakedb/spark-snowflake/pull/252
@YevIgn Can you provide an example for this? I tried
val emptyList = Seq[String]()
tmpDF.select("value").filter(tmpDF.col("name").isin(emptyList:_*))
The generated sql looks like this select * from test_table_in_set where 1 = 0
@sfc-gh-zli Hello, Actual code is quite large and interconnected, looks like can be outlined as roughly:
col_vals_set = set() # Set can be computed empty depending on the previous code.
df.filter(F.col("some_str_col").isin(col_vals_set)).collect()
We use PySpark, Python 3.7.5, Spark 2.3.2, spark-snowflake 2.7.0 (haven't tested yet with newer version). I haven't tried if it works with Scala code.
Also in my example empty set is used directly, not list.
@sfc-gh-ema @sfc-gh-bli Snowflake doesn't support col IN ()
syntax. But it looks spark has already optimized col in()
. It looks the user case the PR mentioned may not happen.
test("Test in empty") {
SnowflakeConnectorUtils.disablePushdownSession(sparkSession)
val loadedDf = sparkSession.read
.format(SNOWFLAKE_SOURCE_NAME)
.options(connectorOptionsNoTable)
.option("dbtable", s"$test_table")
.option("autopushdown", "off")
.load()
// .filter(col("i").isInCollection(Seq(2, 3)))
.filter(col("s").isNotNull and col("i").isInCollection(Seq.empty[Int]))
.sort("i")
loadedDf.show()
}
@emlyn Thanks to report this issue. We tried to reproduce this issue. but it looks Spark has already optimized col IN ()
as where 0 = 1
. So SC may not need to handle it. Could you please retry to see whether it works for you?