spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-40485][SQL] Extend the partitioning options of the JDBC data source

Open LucaCanali opened this issue 2 years ago • 1 comments

What changes were proposed in this pull request?

This proposes to extend the available partitioning options for the JDBC data source.

Why are the changes needed?

Partitioning options allow to read data using multiple workers connected to the target RDBMS. This can improve the performance of data extraction, under the right circumstances.

Currently the only available partitioning and parallelization option for reading from databases is to specify lowerBound, upperBound, together with numPartitions and partitionColumn. The Spark JDBC data source will then use multiple partitions, and thus workers, to read from the RDBMS.
This proposes to add a similar, however complementary, mechanism for partitioning, where a user-provided list of values is used to compute the target partitions.
This provides a way to split the data extraction work among workers that could be aligned with the database physical (partitioned and/or indexed) structure, as in the following example:

option("partitionColumn", "region").
option("numPartitions", 3).
option("partitionColValues", "'eastern', 'central', 'western'").  

This feature is motivated for performance reasons, to scale and speed up data extraction from:

  • list partitioned tables, available in Oracle and PostgreSQL
  • this is also applicable to tables stored in B*Tree indexes, such as in Oracle's IOTs (Index Organized Tables) and SQL Server's Clustered Indexes.

Does this PR introduce any user-facing change?

Yes, this adds the option "partitionColValues" to the JDBC data source.

How was this patch tested?

Added tests to the JDBCSuite and JDBCV2Suite. Also manually tested against Oracle's list partitioned tables.

LucaCanali avatar Sep 19 '22 08:09 LucaCanali

Can one of the admins verify this patch?

AmplabJenkins avatar Sep 19 '22 19:09 AmplabJenkins

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Jul 27 '23 00:07 github-actions[bot]