[WIP][SPARK-48344][SQL] SQL Scripting execution (including Spark Connect)
This PR is recreation of https://github.com/apache/spark/pull/47403 and adds few minor changes to support execution over Spark Connect as well.
What changes were proposed in this pull request?
This pull request introduces the logic of SQL scripts execution, with minor additional changes to support the execution over Spark Connect as well. We decided to do these two changes together because they are tightly coupled and SQL script execution in general has been driven a lot by having Spark Connect in mind.
Main design decision is that, from the correctness perspective, SQL scripts need to be executed eagerly. Examples:
- Select from table
tfollowed by tabletdrop statement - if not executed eagerly, it would throw an exception which is not correct. - Exception handling - SQL standard for scripting requires exception handlers. If we don't execute statements eagerly, the potential exceptions wouldn't be handled when/where it is supposed to happen.
- Bunch of correctness examples - select from
t, insert intot, select fromt- if not executed eagerly, both selects would return the same results, which might not be correct, etc.
Following changes are proposed for SQL script execution:
- introduction of
spark.sql.scripting.enabledflag to enable SQL scripting execution - execution nodes changes:
- introduction of
execute()function to leaf execution nodes (SingleStatementExec) - for now, implements a simple way of collecting results in the in-memory variable. - simplified hierarchy of non-leaf execution nodes - removing
CompoundNestedStatementIteratorExec; now we have onlyCompoundBodyExec. - while iterating through
CompoundBodyinCompoundBodyExec'siterator, we are callingexecute()function of each leaf statement (eager execution).
- introduction of
- interpreter changes:
- introduction of
execute()function - instead of doing it externally,executeis now suppose to collect results from the last executed statement in the SQL scripts; in the future we will introduce API that can return results from multiple statements and we might change naming then, but for now for the sake of simplicity, it's calledexecute. execute()callsexecuteInternal()to collect results from all the statements and takes only the last one. Aforementioned API for multiple results will reuseexecuteInternal()and have logic to collect results from all statements.shouldCollectResults = trueis passed only to standaloneSingleStatementExecnodes - for example, ifSingleStatementExecis a part of If/Else condition we don't want to collect its results, we want to collect results only from standalone SQL statements.
- introduction of
To support execution over SparkConnect, few simple changes were made:
SqlScriptingLogicalOperators->SqlScriptingLogicalPlans:- logical operators are now logical plans, and more specifically, our root logical plan (
CompoundBody) is aCommand. - Spark Connect supports eager execution of
Commandsout of the box and that's why we went with this approach. - this means we don't need
parseScript()inParserInterfaceanymore, we can simply use already existingparsePlan().
- logical operators are now logical plans, and more specifically, our root logical plan (
- To support eager execution, results are returned wrapped into
LocalRelation- to be able to differentiate SQL scripts here, we have addedisSqlScriptflag toLocalRelation. CheckSparkSession.sql()to see howLocalRelationis constructed. SparkConnectPlanner.handleSqlCommand- now that SQL script is handled as aCommand, it's only left to make sure that in cases of SQL scripts the eager execution path is selected and this is done based on theisSqlScriptflag fromLocalRelation.
Side changes:
- Disallowing labels for a top level compound - once we changes
CompoundBodyto becomeLogicalPlanand alteredparsePlanto support both compound statements and single SQL statements, we encountered an issue with parse exception reporting:- example:
SELCT 100would return exception message... near \100`` whereas it should returnnear \SELCT``. - reason:
CompoundBodyhas optional label beforeBEGIN, which was specified in the grammar asmultipartIdentifier COLONwhich meant that spaces could exist between identifier and colon - this means thatSELCTwould get matched tomultipartIdentifierand100would get to be matched againstCOLONwhich fails. - solution: we tried to adapt lexer and parser to change this behavior, but it turns out to be too complex. So, for now, we are removing that from the scope of this PR, a will do a follow-up once we figure out how to properly introduce new token for labels (that wouldn't cause ambiguity with other tokens).
- example:
Why are the changes needed?
A series of previous pull requests introduced various SQL scripting concepts. This pull request, however, introduces the ability to actually execute SQL scripts using Spark.
For now, users will need to enable SQL scripting first using spark.sql.scripting.enabled and then use SparkSession.sql() function to execute SQL scripts. In case that a SQL script is provided, instead of a standalone SQL statement, the result of the last executed statement will be returned.
Does this PR introduce any user-facing change?
SparkSession.sql() is altered to support execution of SQL scripts as well. For standalone SQL statements nothing is changed, but in case a SQL script is provided (wrapped in a BEGIN ... END) it will be successfully processed and results will be returned.
How was this patch tested?
There are already existing test suites for SQL scripting that have been improved to test new functionalities:
SqlScriptingParserSuiteSqlScriptingExecutionNodeSuiteSqlScriptingInterpreterSuite
Also, SqlScriptingE2eSuite has been introduced as well. It's not focusing on parser/interpreter functionalities/correctness, but rather on various usage stuff - whether API can handle SQL scripts properly or whether results are returned in proper format or whether config flags have been applied properly etc.
Was this patch authored or co-authored using generative AI tooling?
No.