spark
spark copied to clipboard
[SPARK-24497][SQL] Support recursive SQL
What changes were proposed in this pull request?
This PR adds recursive query feature to Spark SQL.
A recursive query is defined using the WITH RECURSIVE keywords and referring the name of the common table expression within the query.
The implementation complies with SQL standard and follows similar rules to other relational databases:
- A query is made of an anchor followed by a recursive term.
- The anchor terms doesn't contain self reference and it is used to initialize the query.
- The recursive term contains a self reference and it is used to expand the current set of rows with new ones.
- The anchor and recursive terms must be joined with each other by
UNIONorUNION ALLoperators. - New rows can only be derived from the newly added rows of the previous iteration (or from the initial set of rows of anchor term). This limitation implies that recursive references can't be used with some of the joins, aggregations or subqueries.
Please see cte-recursive.sql for some examples.
The implemetation has the same limiation that SPARK-36447 / https://github.com/apache/spark/pull/33671 has:
With-CTEs mixed with SQL commands or DMLs will still go through the old inline code path because of our non-standard language specs and not-unified command/DML interfaces.
which means that recursive queries are not supported in SQL commands and DMLs.
With https://github.com/apache/spark/pull/42036 this restriction is lifted and a recursive CTE only doesn't work when the CTE is force inlined (spark.sql.legacy.inlineCTEInCommands=true or the command is a multi-insert statement).
Why are the changes needed?
Recursive query is an ANSI SQL feature that is useful to process hierarchical data.
Does this PR introduce any user-facing change?
Yes, adds recursive query feature.
How was this patch tested?
Added new UTs and tests in cte-recursion.sql.
This PR is WIP as it contains https://github.com/apache/spark/pull/40856. Once that PR is merged I will rebase and remove the WIP flag.
https://github.com/apache/spark/pull/40856 got merged and I've rebased this PR. I'm removing the WIP flag and the PR is ready for review.
cc @cloud-fan, @wangyum, @maryannxue, @sigmod
Thanks @peter-toth. I tested this patch locally. But it seem it throws StackOverflowError.
How to reproduce:
./dev/make-distribution.sh --tgz -Phive -Phive-thriftserver
tar -zxf spark-3.5.0-SNAPSHOT-bin-3.3.5.tgz
cd spark-3.5.0-SNAPSHOT-bin-3.3.5
bin/spark-sql
spark-sql (default)> WITH RECURSIVE t(n) AS (
> VALUES (1)
> UNION ALL
> SELECT n+1 FROM t WHERE n < 100
> )
> SELECT sum(n) FROM t;
23/05/30 13:21:21 ERROR Executor: Exception in task 0.0 in stage 265.0 (TID 199)
java.lang.StackOverflowError
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
Thanks @peter-toth. I tested this patch locally. But it seem it throws
StackOverflowError. How to reproduce:./dev/make-distribution.sh --tgz -Phive -Phive-thriftserver tar -zxf spark-3.5.0-SNAPSHOT-bin-3.3.5.tgz cd spark-3.5.0-SNAPSHOT-bin-3.3.5 bin/spark-sqlspark-sql (default)> WITH RECURSIVE t(n) AS ( > VALUES (1) > UNION ALL > SELECT n+1 FROM t WHERE n < 100 > ) > SELECT sum(n) FROM t; 23/05/30 13:21:21 ERROR Executor: Exception in task 0.0 in stage 265.0 (TID 199) java.lang.StackOverflowError at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
Thanks for testing this PR @wangyum. Iterestingly, I didn't encounter stack overflow when recursion level is <100. The error starts to appear at level ~170 in my tests. I guess this depends on your default stack size. Since recursion works in a way that each iteration depends on the previous iteration, the RDD lineage of the tasks are getting bigger and bigger and the deserialization of those tasks can throw stack overflow error at some point. Let me amend this PR with adding optional checkpointing so as to truncate RDD linage and be able to deal with deeper recursion...
Hey folks, So glad to see this feature is being worked on. Do you have any estimates when this could be released ?
Hey folks, So glad to see this feature is being worked on. Do you have any estimates when this could be released ?
This feature very likely won't make it into the next release (Spark 3.5) as tbe branch cut is in 2 weeks. But I will try to add it to the one after next (Spark 4.0).
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!
@peter-toth thank you so much for sticking with this over three major versions and three separate pull requests. Recursive queries would be really nice to have in Spark SQL.
@peter-toth Hi, we are very much expecting a recursive sql. We hope you will be able to complete this pull request :)
@milastdbx do you think you can take over this PR?
cc @cloud-fan, @mitkedb, @MaxGekk
Yes, thank you.
Milan
On Thu, Dec 21, 2023 at 12:29 PM Peter Toth @.***> wrote:
@milastdbx https://github.com/milastdbx do you think you can take over this PR?
cc @cloud-fan https://github.com/cloud-fan, @mitkedb https://github.com/mitkedb
— Reply to this email directly, view it on GitHub https://github.com/apache/spark/pull/40744#issuecomment-1866092645, or unsubscribe https://github.com/notifications/unsubscribe-auth/BD3GPBCAU24GF5QGXTQCYLLYKQMRHAVCNFSM6AAAAAAW2SUTGCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRWGA4TENRUGU . You are receiving this because you were mentioned.Message ID: @.***>
Can this PR be merged? I also encountered this scenario
If want to achieve hierarch query, you could try following while this PR is not available atm.
https://pypi.org/project/pyspark-connectby/
Any update on this PR?
@milastdbx are you still planning to take this up?
@wangyum I see that you started the review last year and the issues you raised were addressed by Peter.
Then @milastdbx was tagged to take over the PR, but I don't see the issue being assigned to you yet.
How do we get this PR reviewed?
cc @cloud-fan, @mitkedb, @MaxGekk
Thanks @peter-toth. I tested this patch locally. But it seem it throws
StackOverflowError. How to reproduce:./dev/make-distribution.sh --tgz -Phive -Phive-thriftserver tar -zxf spark-3.5.0-SNAPSHOT-bin-3.3.5.tgz cd spark-3.5.0-SNAPSHOT-bin-3.3.5 bin/spark-sqlspark-sql (default)> WITH RECURSIVE t(n) AS ( > VALUES (1) > UNION ALL > SELECT n+1 FROM t WHERE n < 100 > ) > SELECT sum(n) FROM t; 23/05/30 13:21:21 ERROR Executor: Exception in task 0.0 in stage 265.0 (TID 199) java.lang.StackOverflowError at java.lang.reflect.Constructor.newInstance(Constructor.java:423)Thanks for testing this PR @wangyum. Iterestingly, I didn't encounter stack overflow when recursion level is <100. The error starts to appear at level ~170 in my tests. I guess this depends on your default stack size. Since recursion works in a way that each iteration depends on the previous iteration, the RDD lineage of the tasks are getting bigger and bigger and the deserialization of those tasks can throw stack overflow error at some point. Let me amend this PR with adding optional checkpointing so as to truncate RDD linage and be able to deal with deeper recursion...
@peter-toth I have not looked closely at the implementation but I do have a question about this: has the logic been implemented in some way similar to tail call optimization such that there is no recursion depth limit?
Any update ? Thanks !
Let me close this PR as seemingly its open state causes some confusion. Feel free to use reuse the code if anyone wants to tacke this issue.
@peter-toth can you also update the status on the Jira ticket? https://issues.apache.org/jira/browse/SPARK-24497