spark [SPARK-40508][SQL] Treat unknown partitioning as UnknownPartitioning

What changes were proposed in this pull request?

When running spark application against spark 3.3, I see the following :

java.lang.IllegalArgumentException: Unsupported data source V2 partitioning type: CustomPartitioning
    at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
    at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)

The CustomPartitioning works fine with Spark 3.2.1 This PR proposes to relax the code and treat all unknown partitioning the same way as that for UnknownPartitioning.

Why are the changes needed?

3.3.0 doesn't seem to warrant such behavioral change (from that of 3.2.1 release).

Does this PR introduce any user-facing change?

This would allow user's custom partitioning to continue to work with 3.3.x releases.

How was this patch tested?

Existing test suite.

Sep 20 '22 16:09 tedyu

cc @sunchao

Sep 20 '22 16:09 tedyu

@tedyu could you share some background information? if CustomPartitioning is handled the same way as UnknownPartitioning, why can't you just use the latter instead?

Sep 20 '22 16:09 sunchao

If I subclass UnknownPartitioning directly, I would get this compilation error:

[error] /nfusr/dev-server/zyu/spark-cassandra-connector/connector/src/main/scala/com/datastax/spark/connector/datasource/CassandraScanBuilder.scala:327:92: not enough arguments for constructor UnknownPartitioning: (x$1: Int)org.apache.spark.sql.connector.read.partitioning.UnknownPartitioning.
[error] Unspecified value parameter x$1.
[error] case class CassandraPartitioning(partitionKeys: Array[String], numPartitions: Int) extends UnknownPartitioning {
[error]                                                                                            ^
[error] one error found

Sep 20 '22 16:09 tedyu

Can you directly report UnknownPartitioning to Spark?

Sep 20 '22 16:09 sunchao

If custom partitioning reports UnknownPartitioning to Spark and can keep 3.2.1 behavior, that means the current check is not desired.

Sep 20 '22 16:09 tedyu

I have run the test using Cassandra Spark connector and modified Spark (with this patch).

The test passes (without modification to Cassandra Spark connector or client code).

Sep 20 '22 18:09 tedyu

I guess this PR make sense. @tedyu could you:

create a Spark JIRA for this issue? and update the PR title to reflect it?
add a warning message too? clients may expect Spark to use the partitioning they reported and could be surprised that Spark internally ignores it, so a warning message would be helpful for them to debug.

I think the best solution is for connectors such as Cassandra to adopt the new API, otherwise they could see severe performance penalties.

Sep 20 '22 19:09 sunchao

@sunchao Please take another look.

Sep 20 '22 20:09 tedyu

@sunchao https://github.com/tedyu/spark/runs/8459534296 shows that all tests have passed.

Sep 20 '22 23:09 tedyu

@sunchao Can this PR be merged ?

Sep 21 '22 15:09 tedyu

Thanks! merged to master. @tedyu could you create a PR for branch-3.3 as well?

Sep 21 '22 16:09 sunchao

Yea good point. A unit test would be nice.

Sep 21 '22 16:09 sunchao

spark spark copied to clipboard

[SPARK-40508][SQL] Treat unknown partitioning as UnknownPartitioning

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

spark
spark copied to clipboard