spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-40508][SQL] Treat unknown partitioning as UnknownPartitioning

Open tedyu opened this issue 2 years ago • 10 comments

What changes were proposed in this pull request?

When running spark application against spark 3.3, I see the following :

java.lang.IllegalArgumentException: Unsupported data source V2 partitioning type: CustomPartitioning
    at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
    at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)

The CustomPartitioning works fine with Spark 3.2.1 This PR proposes to relax the code and treat all unknown partitioning the same way as that for UnknownPartitioning.

Why are the changes needed?

3.3.0 doesn't seem to warrant such behavioral change (from that of 3.2.1 release).

Does this PR introduce any user-facing change?

This would allow user's custom partitioning to continue to work with 3.3.x releases.

How was this patch tested?

Existing test suite.

tedyu avatar Sep 20 '22 16:09 tedyu

cc @sunchao

tedyu avatar Sep 20 '22 16:09 tedyu

@tedyu could you share some background information? if CustomPartitioning is handled the same way as UnknownPartitioning, why can't you just use the latter instead?

sunchao avatar Sep 20 '22 16:09 sunchao

If I subclass UnknownPartitioning directly, I would get this compilation error:

[error] /nfusr/dev-server/zyu/spark-cassandra-connector/connector/src/main/scala/com/datastax/spark/connector/datasource/CassandraScanBuilder.scala:327:92: not enough arguments for constructor UnknownPartitioning: (x$1: Int)org.apache.spark.sql.connector.read.partitioning.UnknownPartitioning.
[error] Unspecified value parameter x$1.
[error] case class CassandraPartitioning(partitionKeys: Array[String], numPartitions: Int) extends UnknownPartitioning {
[error]                                                                                            ^
[error] one error found

tedyu avatar Sep 20 '22 16:09 tedyu

Can you directly report UnknownPartitioning to Spark?

sunchao avatar Sep 20 '22 16:09 sunchao

If custom partitioning reports UnknownPartitioning to Spark and can keep 3.2.1 behavior, that means the current check is not desired.

tedyu avatar Sep 20 '22 16:09 tedyu

I have run the test using Cassandra Spark connector and modified Spark (with this patch).

The test passes (without modification to Cassandra Spark connector or client code).

tedyu avatar Sep 20 '22 18:09 tedyu

I guess this PR make sense. @tedyu could you:

  • create a Spark JIRA for this issue? and update the PR title to reflect it?
  • add a warning message too? clients may expect Spark to use the partitioning they reported and could be surprised that Spark internally ignores it, so a warning message would be helpful for them to debug.

I think the best solution is for connectors such as Cassandra to adopt the new API, otherwise they could see severe performance penalties.

sunchao avatar Sep 20 '22 19:09 sunchao

@sunchao Please take another look.

tedyu avatar Sep 20 '22 20:09 tedyu

@sunchao https://github.com/tedyu/spark/runs/8459534296 shows that all tests have passed.

tedyu avatar Sep 20 '22 23:09 tedyu

@sunchao Can this PR be merged ?

tedyu avatar Sep 21 '22 15:09 tedyu

Thanks! merged to master. @tedyu could you create a PR for branch-3.3 as well?

sunchao avatar Sep 21 '22 16:09 sunchao

Yea good point. A unit test would be nice.

sunchao avatar Sep 21 '22 16:09 sunchao