spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL

Open szehon-ho opened this issue 1 year ago • 7 comments

What changes were proposed in this pull request?

in Spark SQL, add 'WITH OPTIONS' syntax to support dynamic relation options.

This is a continuation of https://github.com/apache/spark/pull/41683 based on @cloud-fan's nice suggestion. That was itself a continuation of https://github.com/apache/spark/pull/34072.

Why are the changes needed?

This will allow Spark SQL to have equivalence to DataFrameReader API. For example, it is possible to specify options today to DataSources as follows via the API:

 spark.read.format("jdbc").option("fetchSize", 0).load()

This pr allows an equivalent Spark SQL syntax to specify options:

SELECT * FROM jdbcTable WITH OPTIONS(`fetchSize` = 0)

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test in DataSourceV2SQLSuite

Was this patch authored or co-authored using generative AI tooling?

No

szehon-ho avatar May 22 '24 20:05 szehon-ho

cc @huaxingao @cloud-fan could you guys take a look? Thanks @cloud-fan for the suggestion to use WITH syntax, it makes the logic much simpler.

szehon-ho avatar May 23 '24 00:05 szehon-ho

Fixed the review items, thanks @huaxingao , if you want to take another look

szehon-ho avatar May 24 '24 00:05 szehon-ho

What's the dynamic table option? It seems static.

beliefer avatar Jun 03 '24 07:06 beliefer

@szehon-ho Since propertyList is not optional any more, shall we remove the check if (clause.options == null) in createUnresolvedRelation, and also remove the changes in error-conditions.json and QueryCompilationErrors?

huaxingao avatar Jun 03 '24 17:06 huaxingao

What's the dynamic table option? It seems static.

I got you mean now.

beliefer avatar Jun 04 '24 11:06 beliefer

@beliefer i didnt make the initial JIRA title :) , let me know if something makes more sense.

szehon-ho avatar Jun 04 '24 16:06 szehon-ho

@cloud-fan i made some changes to update the PR description and change back to WITH OPTIONS as @huaxingao mentioned to me. We can look at write options subsequently.

Did we want to move forward with this, or are there some new suggestions ? Thanks

szehon-ho avatar Jun 10 '24 18:06 szehon-ho

Thank you all for your thoughtful comments and suggestions. I appreciate the discussion around leveraging TVF-like syntax, and I understand the potential benefits of using a more SQL-standard approach. However, I would like to highlight why I believe the WITH OPTIONS syntax remains the most beneficial for our current goals:

  1. Simplicity and Accessibility: The WITH OPTIONS syntax is straightforward and directly communicates the intent of applying specific options to the SQL query. It's an intuitive approach that can be easily understood by users without extensive SQL background.

  2. Separation of Concerns: Using WITH OPTIONS cleanly separates the functional part of the SQL query from the configuration parameters. This clarity improves code readability and maintenance, as opposed to embedding options within a TVF syntax that could obscure the primary purpose of the query.

  3. Avoiding Over-complexity: While TVF-like syntax offers elegance, it introduces a level of complexity that may not be necessary. For instance, distinguishing between a table and a TVF in queries, as mentioned, can confuse users and complicate the parser's job. Our proposed syntax avoids this by keeping the query structure familiar and straightforward.

  4. Focus on Practical Use Cases: The WITH OPTIONS syntax is specifically designed to handle practical scenarios where users might need to dynamically configure queries based on different execution contexts, such as adjusting fetchSize for performance tuning. This approach is particularly beneficial in environments where such configurations are frequently modified.

I propose we continue with the WITH OPTIONS syntax for its clarity, ease of use, and flexibility. I would like to move this PR forward. 🙂

huaxingao avatar Jul 04 '24 17:07 huaxingao

only one comment: https://github.com/apache/spark/pull/46707/files#r1669539861 , I'm also fine to just remove OPTIONS.

cloud-fan avatar Jul 09 '24 02:07 cloud-fan

Thanks, removed OPTIONS and other review comments. Can add OPTIONS as optional in subsequent pr.

szehon-ho avatar Jul 09 '24 22:07 szehon-ho

thanks, merging to master!

cloud-fan avatar Jul 10 '24 04:07 cloud-fan

Thanks @szehon-ho for this! This is really useful for SELECT statements. However we also have UPDATE and DELETE statements which can lead to a scans. Should we also add also add support for options in UPDATE and DELETE? Would be happy to raise a PR.

shardulm94 avatar Jul 11 '24 01:07 shardulm94

@shardulm94 feel free to raise PRs! I think it makes sense to support DML as the DataFrameWriter API can also specify options

cloud-fan avatar Jul 11 '24 02:07 cloud-fan

sounds good, thanks! i was planning to do follow up for write (INSERT) and can look at any other missing ones too.

szehon-ho avatar Jul 11 '24 06:07 szehon-ho

Thanks @szehon-ho! I think there may be an interesting discussion here. E.g. consider the DELETE statement with some options passed

DELETE FROM tbl WITH ('split-size' = 5)
WHERE id > 10

What do these options apply to? The scan from the table, or the write to the table?

  1. We can pass the options to both the read and the write, and let the datasource disambiguate. I don't think this is great because there will likely be a case in the future whether the same option with different values would need to be passed to read and write.
  2. We could create some syntax disambiguating reads v/s writes. E.g. WITH READ OPTIONS('a', 'b') WRITE OPTIONS ('c', 'd'). I prefer this approach, I think this is more clear for the user, but it goes against previous discussions in this PR.

Other ideas welcome. Same scenario applies to UPDATE which also has single table identifier to read from and write to. MERGE and INSERT have separate source and target identifiers and so you could specify options separately per identifier.

shardulm94 avatar Jul 11 '24 07:07 shardulm94

Yea its a good point, I would like WITH READ and WITH WRITE options as its most clear, but havent found SQL that looks like this, curious what everyone thinks? In the worst case, data source can distinguish, like you said.

That reminds me that maybe we need this in the merge/update apis as well.

szehon-ho avatar Jul 13 '24 21:07 szehon-ho

Other ideas welcome. Same scenario applies to UPDATE which also has single table identifier to read from and write to.

I noticed SQL Server uses with to specify hints: https://learn.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-table?view=sql-server-ver16. We may refer that and design new syntaxes as

with (options(k1=v1, k2=v2), read(k3=v3, k4=v4), write(k5=v5, k6=v6))

? However, I think WITH READ OPTIONS('a', 'b') WRITE OPTIONS ('c', 'd') is clear and sufficient for read/write options only.

advancedxy avatar Aug 06 '24 14:08 advancedxy