spark
spark copied to clipboard
[SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL
What changes were proposed in this pull request?
in Spark SQL, add 'WITH OPTIONS' syntax to support dynamic relation options.
This is a continuation of https://github.com/apache/spark/pull/41683 based on @cloud-fan's nice suggestion. That was itself a continuation of https://github.com/apache/spark/pull/34072.
Why are the changes needed?
This will allow Spark SQL to have equivalence to DataFrameReader API. For example, it is possible to specify options today to DataSources as follows via the API:
spark.read.format("jdbc").option("fetchSize", 0).load()
This pr allows an equivalent Spark SQL syntax to specify options:
SELECT * FROM jdbcTable WITH OPTIONS(`fetchSize` = 0)
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit test in DataSourceV2SQLSuite
Was this patch authored or co-authored using generative AI tooling?
No
cc @huaxingao @cloud-fan could you guys take a look? Thanks @cloud-fan for the suggestion to use WITH syntax, it makes the logic much simpler.
Fixed the review items, thanks @huaxingao , if you want to take another look
What's the dynamic table option? It seems static.
@szehon-ho
Since propertyList is not optional any more, shall we remove the check if (clause.options == null) in createUnresolvedRelation, and also remove the changes in error-conditions.json and QueryCompilationErrors?
What's the dynamic table option? It seems static.
I got you mean now.
@beliefer i didnt make the initial JIRA title :) , let me know if something makes more sense.
@cloud-fan i made some changes to update the PR description and change back to WITH OPTIONS as @huaxingao mentioned to me. We can look at write options subsequently.
Did we want to move forward with this, or are there some new suggestions ? Thanks
Thank you all for your thoughtful comments and suggestions. I appreciate the discussion around leveraging TVF-like syntax, and I understand the potential benefits of using a more SQL-standard approach. However, I would like to highlight why I believe the WITH OPTIONS syntax remains the most beneficial for our current goals:
-
Simplicity and Accessibility: The
WITH OPTIONSsyntax is straightforward and directly communicates the intent of applying specific options to the SQL query. It's an intuitive approach that can be easily understood by users without extensive SQL background. -
Separation of Concerns: Using
WITH OPTIONScleanly separates the functional part of the SQL query from the configuration parameters. This clarity improves code readability and maintenance, as opposed to embedding options within a TVF syntax that could obscure the primary purpose of the query. -
Avoiding Over-complexity: While TVF-like syntax offers elegance, it introduces a level of complexity that may not be necessary. For instance, distinguishing between a table and a TVF in queries, as mentioned, can confuse users and complicate the parser's job. Our proposed syntax avoids this by keeping the query structure familiar and straightforward.
-
Focus on Practical Use Cases: The
WITH OPTIONSsyntax is specifically designed to handle practical scenarios where users might need to dynamically configure queries based on different execution contexts, such as adjustingfetchSizefor performance tuning. This approach is particularly beneficial in environments where such configurations are frequently modified.
I propose we continue with the WITH OPTIONS syntax for its clarity, ease of use, and flexibility. I would like to move this PR forward. 🙂
only one comment: https://github.com/apache/spark/pull/46707/files#r1669539861 , I'm also fine to just remove OPTIONS.
Thanks, removed OPTIONS and other review comments. Can add OPTIONS as optional in subsequent pr.
thanks, merging to master!
Thanks @szehon-ho for this! This is really useful for SELECT statements. However we also have UPDATE and DELETE statements which can lead to a scans. Should we also add also add support for options in UPDATE and DELETE? Would be happy to raise a PR.
@shardulm94 feel free to raise PRs! I think it makes sense to support DML as the DataFrameWriter API can also specify options
sounds good, thanks! i was planning to do follow up for write (INSERT) and can look at any other missing ones too.
Thanks @szehon-ho! I think there may be an interesting discussion here. E.g. consider the DELETE statement with some options passed
DELETE FROM tbl WITH ('split-size' = 5)
WHERE id > 10
What do these options apply to? The scan from the table, or the write to the table?
- We can pass the options to both the read and the write, and let the datasource disambiguate. I don't think this is great because there will likely be a case in the future whether the same option with different values would need to be passed to read and write.
- We could create some syntax disambiguating reads v/s writes. E.g.
WITH READ OPTIONS('a', 'b') WRITE OPTIONS ('c', 'd'). I prefer this approach, I think this is more clear for the user, but it goes against previous discussions in this PR.
Other ideas welcome. Same scenario applies to UPDATE which also has single table identifier to read from and write to. MERGE and INSERT have separate source and target identifiers and so you could specify options separately per identifier.
Yea its a good point, I would like WITH READ and WITH WRITE options as its most clear, but havent found SQL that looks like this, curious what everyone thinks? In the worst case, data source can distinguish, like you said.
That reminds me that maybe we need this in the merge/update apis as well.
Other ideas welcome. Same scenario applies to UPDATE which also has single table identifier to read from and write to.
I noticed SQL Server uses with to specify hints: https://learn.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-table?view=sql-server-ver16. We may refer that and design new syntaxes as
with (options(k1=v1, k2=v2), read(k3=v3, k4=v4), write(k5=v5, k6=v6))
? However, I think WITH READ OPTIONS('a', 'b') WRITE OPTIONS ('c', 'd') is clear and sufficient for read/write options only.