spark
spark copied to clipboard
[SPARK-40508][SQL] Treat unknown partitioning as UnknownPartitioning
What changes were proposed in this pull request?
When running spark application against spark 3.3, I see the following :
java.lang.IllegalArgumentException: Unsupported data source V2 partitioning type: CustomPartitioning
at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
The CustomPartitioning
works fine with Spark 3.2.1
This PR proposes to relax the code and treat all unknown partitioning the same way as that for UnknownPartitioning
.
Why are the changes needed?
3.3.0 doesn't seem to warrant such behavioral change (from that of 3.2.1 release).
Does this PR introduce any user-facing change?
This would allow user's custom partitioning to continue to work with 3.3.x releases.
How was this patch tested?
Existing test suite.
cc @sunchao
@tedyu could you share some background information? if CustomPartitioning
is handled the same way as UnknownPartitioning
, why can't you just use the latter instead?
If I subclass UnknownPartitioning
directly, I would get this compilation error:
[error] /nfusr/dev-server/zyu/spark-cassandra-connector/connector/src/main/scala/com/datastax/spark/connector/datasource/CassandraScanBuilder.scala:327:92: not enough arguments for constructor UnknownPartitioning: (x$1: Int)org.apache.spark.sql.connector.read.partitioning.UnknownPartitioning.
[error] Unspecified value parameter x$1.
[error] case class CassandraPartitioning(partitionKeys: Array[String], numPartitions: Int) extends UnknownPartitioning {
[error] ^
[error] one error found
Can you directly report UnknownPartitioning
to Spark?
If custom partitioning reports UnknownPartitioning
to Spark and can keep 3.2.1 behavior, that means the current check is not desired.
I have run the test using Cassandra Spark connector and modified Spark (with this patch).
The test passes (without modification to Cassandra Spark connector or client code).
I guess this PR make sense. @tedyu could you:
- create a Spark JIRA for this issue? and update the PR title to reflect it?
- add a warning message too? clients may expect Spark to use the partitioning they reported and could be surprised that Spark internally ignores it, so a warning message would be helpful for them to debug.
I think the best solution is for connectors such as Cassandra to adopt the new API, otherwise they could see severe performance penalties.
@sunchao Please take another look.
@sunchao https://github.com/tedyu/spark/runs/8459534296 shows that all tests have passed.
@sunchao Can this PR be merged ?
Thanks! merged to master. @tedyu could you create a PR for branch-3.3 as well?
Yea good point. A unit test would be nice.