[WIP][SPARK-48344][SQL] SQL Scripting execution (including Spark Connect)

Open davidm-db opened this issue 1 year ago • 0 comments

This PR is recreation of https://github.com/apache/spark/pull/47403 and adds few minor changes to support execution over Spark Connect as well.

What changes were proposed in this pull request?

This pull request introduces the logic of SQL scripts execution, with minor additional changes to support the execution over Spark Connect as well. We decided to do these two changes together because they are tightly coupled and SQL script execution in general has been driven a lot by having Spark Connect in mind.

Main design decision is that, from the correctness perspective, SQL scripts need to be executed eagerly. Examples:

Select from table t followed by table t drop statement - if not executed eagerly, it would throw an exception which is not correct.
Exception handling - SQL standard for scripting requires exception handlers. If we don't execute statements eagerly, the potential exceptions wouldn't be handled when/where it is supposed to happen.
Bunch of correctness examples - select from t, insert into t, select from t - if not executed eagerly, both selects would return the same results, which might not be correct, etc.

Following changes are proposed for SQL script execution:

introduction of spark.sql.scripting.enabled flag to enable SQL scripting execution
execution nodes changes:
- introduction of execute() function to leaf execution nodes (SingleStatementExec) - for now, implements a simple way of collecting results in the in-memory variable.
- simplified hierarchy of non-leaf execution nodes - removing CompoundNestedStatementIteratorExec; now we have only CompoundBodyExec.
- while iterating through CompoundBody in CompoundBodyExec's iterator, we are calling execute() function of each leaf statement (eager execution).
interpreter changes:
- introduction of execute() function - instead of doing it externally, execute is now suppose to collect results from the last executed statement in the SQL scripts; in the future we will introduce API that can return results from multiple statements and we might change naming then, but for now for the sake of simplicity, it's called execute.
- execute() calls executeInternal() to collect results from all the statements and takes only the last one. Aforementioned API for multiple results will reuse executeInternal() and have logic to collect results from all statements.
- shouldCollectResults = true is passed only to standalone SingleStatementExec nodes - for example, if SingleStatementExec is a part of If/Else condition we don't want to collect its results, we want to collect results only from standalone SQL statements.

To support execution over SparkConnect, few simple changes were made:

SqlScriptingLogicalOperators -> SqlScriptingLogicalPlans:
- logical operators are now logical plans, and more specifically, our root logical plan (CompoundBody) is a Command.
- Spark Connect supports eager execution of Commands out of the box and that's why we went with this approach.
- this means we don't need parseScript() in ParserInterface anymore, we can simply use already existing parsePlan().
To support eager execution, results are returned wrapped into LocalRelation - to be able to differentiate SQL scripts here, we have added isSqlScript flag to LocalRelation. Check SparkSession.sql() to see how LocalRelation is constructed.
SparkConnectPlanner.handleSqlCommand - now that SQL script is handled as a Command, it's only left to make sure that in cases of SQL scripts the eager execution path is selected and this is done based on the isSqlScript flag from LocalRelation.

Side changes:

Disallowing labels for a top level compound - once we changes CompoundBody to become LogicalPlan and altered parsePlan to support both compound statements and single SQL statements, we encountered an issue with parse exception reporting:
- example: SELCT 100 would return exception message ... near \100`` whereas it should return near \SELCT``.
- reason: CompoundBody has optional label before BEGIN, which was specified in the grammar as multipartIdentifier COLON which meant that spaces could exist between identifier and colon - this means that SELCT would get matched to multipartIdentifier and 100 would get to be matched against COLON which fails.
- solution: we tried to adapt lexer and parser to change this behavior, but it turns out to be too complex. So, for now, we are removing that from the scope of this PR, a will do a follow-up once we figure out how to properly introduce new token for labels (that wouldn't cause ambiguity with other tokens).

Why are the changes needed?

A series of previous pull requests introduced various SQL scripting concepts. This pull request, however, introduces the ability to actually execute SQL scripts using Spark. For now, users will need to enable SQL scripting first using spark.sql.scripting.enabled and then use SparkSession.sql() function to execute SQL scripts. In case that a SQL script is provided, instead of a standalone SQL statement, the result of the last executed statement will be returned.

Does this PR introduce any user-facing change?

SparkSession.sql() is altered to support execution of SQL scripts as well. For standalone SQL statements nothing is changed, but in case a SQL script is provided (wrapped in a BEGIN ... END) it will be successfully processed and results will be returned.

How was this patch tested?

There are already existing test suites for SQL scripting that have been improved to test new functionalities:

SqlScriptingParserSuite
SqlScriptingExecutionNodeSuite
SqlScriptingInterpreterSuite

Also, SqlScriptingE2eSuite has been introduced as well. It's not focusing on parser/interpreter functionalities/correctness, but rather on various usage stuff - whether API can handle SQL scripts properly or whether results are returned in proper format or whether config flags have been applied properly etc.

Was this patch authored or co-authored using generative AI tooling?

No.

Aug 05 '24 13:08 davidm-db