spark [SPARK-49321][SQL] Bind JDBC dialect to JDBCRDD at construction

Registered dialects may differ between driver and executors. Bind dialect to the RDD on creation to use the same dialect regardless of JVM state.

What changes were proposed in this pull request?

Bind the resolved JDBC dialect to the JDBCRDD instance during construction.

Why are the changes needed?

While working on a Spark application, I created a custom JDBC dialect and registered before creating a SparkSession. Spark did not use the dialect's JdbcSQLQueryBuilder, however, so I investigated further. I discovered that JDBCRDD repeats the dialect resolution, and this runs on the executor.

From what I can understand, additional dialects registered by the driver will not be available in the executor when running in cluster mode. Consequently, I propose binding the resolved dialect to the JDBCRDD instance to produce deterministic behavior.

Does this PR introduce any user-facing change?

No aside from resolving a bug that some users may have encountered.

How was this patch tested?

I have not written tests, but I have manually tested the fix on a local 3.5.1 cluster. Applying this change produced the desired behavior (i.e. the custom dialect's JdbcSQLQueryBuilder was used to generate queries from executors).

Was this patch authored or co-authored using generative AI tooling?

No

Mar 06 '24 18:03 johnnywalker

@johnnywalker Let's file a JIRA, see also https://spark.apache.org/contributing.html

@urosstan-db I will leave this to you to approve or not. cc @cloud-fan too

Aug 18 '24 07:08 HyukjinKwon

JIRA created: https://issues.apache.org/jira/browse/SPARK-49321

Aug 20 '24 13:08 johnnywalker

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Nov 29 '24 00:11 github-actions[bot]