spark
spark copied to clipboard
[SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies
What changes were proposed in this pull request?
This is a draft change of the current state of the Spark Connect prototype implemented as a driver plugin to separate the classpaths and shaded dependent libraries. The benefit of a driver plugin is that the plugin has access to everything inside Spark but can itself be a leaf-build. This allows us to shade it's own minimal set of dependencies without brining new dependencies into Spark.
How is Spark Connect Integrated?
Spark Connect is implemented as a driver plugin for Apache Spark. This means that it can be optionally loaded by specifying the following option during startup.
--conf spark.plugins=org.apache.spark.sql.sparkconnect.service.SparkConnectPlugin
The way that the plugin is integrated into the build system does not require any additional changes and the properly configured Jar files are produced as part of the build.
How are GRPC and Guava dependencies handled?
Spark Connect relies both on GRPC and Guava. Since the version of Guava is incompatible to the version used in Spark, the plugin shades this dependency directly as part of the build. In the case of Maven this is very easy in the case of SBT it's a bit trickier. For SBT we hook into the copyDeps
stage of the build and instead of copying the artifact dependency we copy the shaded dependency instead.
Why are the changes needed?
https://issues.apache.org/jira/browse/SPARK-39375
Does this PR introduce any user-facing change?
Experimental API for Spark Connect
How was this patch tested?
This patch was added by adding two new test surface areas. On the driver plugin side this patch adds rudimentary test infrastructure that makes sure that Proto plans can be translated into Spark logical plans. Secondly, there are Python integration tests that test the DataFrame to Proto conversion and test the end-to-end execution of Spark Connect queries.
To avoid issues with other parts of the build, only tests that require Spark Connect will be started with the driver plugin.
Note:
- The pyspark tests are only support for regular CPython and disabled for pypy
-
mypy
was initially disabled for this patch to reduce friction. Will be enabled again.
Caveats
-
mypy
is disabled on the Python prototype until follow-up patches fix the warnings. - The unit-test coverage in Scala is not good. Follow up tests will increases coverage, but to keep this test reasonable I did not add more code.
New Dependencies
This patch adds GRPC, Protobuf, and an updated Guava to the leaf-build of Spark Connect. These dependencies are not available to the top-level Spark build and are not available to other Spark projects. To avoid dependency issues, the Maven and SBT build will shad GRPC, Protobuf and Guava explicitly.
Follow Ups
Due to the size of the PR, smaller clean-up changes will be submitted as follow up PRs.
Can one of the admins verify this patch?
You can retrigger the ci, the issue already fix by github. https://www.githubstatus.com/incidents/d181frs643d4
I retriggered the builds let's see if it helps.
To speed the base image build time, you could move new added python depends in the end of dev/infra/dockerfile when your PR still WIP.
@Yikun the Docker build works fine again, but I'm still facing the issue with the Doc build that causes my build to fail. It happens in a plain SBT build sbt clean package -Phive
https://github.com/grundprinzip/spark/runs/8226589025?check_suite_focus=true#step:21:6379
Any ideas?
not having a deep look, not only sbt clean package -Phive
failed[2] but also build/sbt -Pkinesis-asl clean compile unidoc
[1] failed
But you didn't change any code in [1][2], so this might relate to some change in pom.xml
, SparkBuild.scala
. (when some profile not enable then failed)
[1] https://github.com/grundprinzip/spark/runs/8226589025?check_suite_focus=true#step:21:483 [2] https://github.com/grundprinzip/spark/runs/8226589025?check_suite_focus=true#step:21:6379
Interestingly the build says:
[info] Main Scala API documentation successful.
[success] Total time: 159 s (02:39), completed Sep 7, 2022 11:38:25 AM
Moving back into docs dir.
Removing old docs
From what @HyukjinKwon told me is that the unidoc build can have false positives due to the way we generate the Javadoc from the Scala code, and according to the Ruby build script for the documentation, the system
command finishes with exit code 0 otherwise it would have thrown before.
I did more investigations on the build failure. It seems that the doc build for Python (and R) is the only place where we actually call the build/sbt package
target. Most of the other targets execute Test/package
This is ready for a look now.
Since the whole feature and codes would be very large, we (explicitly I, @grundprinzip, @amaliujia, and @cloud-fan) discussed offline, and decided to propose to split this. This PR is basically the minimal working version note that most of code lines here were generated from the protobuf.
SPARK-39375 is a parent JIRA, and we described the current action items at this moment. More JIRAs will be filed accordingly to the plan below:
High-level plan and design:
Low-level plan:
Short-term
- Extend test coverage for SparkConnectPlanner (right now at 76% line coverage)
- Extend test coverage for Spark Connect Python client
- Type annotations for Spark Connect Python client to re-enable mypy
- Clean-up documentation in PySpark code for Spark Connect
- Documentation for PySpark in README and doctests
- Proto validation in server and/or client
- Validation:
- Syntactic -> Parsing
- Semantic -> Analysis
- Alternatively only return error class to clients upon failures.
- Initial DSL framework for protobuf testing
- Restructure the build structure to match with other components
- Maven
- SBT
Long-term
- Testing with custom DSL
-
LocalRelation
- Better error handling for semantic failures
- Spark and Session configurations
- Scala Client
- SBT incremental build and testing environment
- DataSources
- UDFs
- Packaging / Releasing
I am thinking about merging it without making major changes in this PR if there aren't major issues found, and I myself will take a look for important/urgent items very soon according to the plan described above.
I would like to be transparent here. The frank reasons that I think as above are as follow:
- Multiple people will intensively cowork together for this component but individual works in a different timezone which makes it difficult to work within this @grundprinzip's branch.
- Difficult to manage the credibility. The whole size of work would be very huge, and I would like to avoid sharing the same credit with all the coauthors. Different person will sign off and be the author for individual change.
- I would like to speed up by fully leveraging individual fork's GitHub Actions resources. Currently, @grundprinzip's GitHub resource here is a bottleneck.
Hope this plan (1. merging it as the minimal working version without major changes if there aren't major concerns, and 2. sticking to the plan described in https://github.com/apache/spark/pull/37710#issuecomment-1249261232) makes sense to other committers too. Are you guys okay with this? @dongjoon-hyun @viirya @mridulm @srowen @wangyum @sunchao @huaxingao @Yikun @rxin @hvanhovell @tgravescs (derived from SPIP voting).
If people think the general direction is fine, I will start to review it within a couple of days.
It wasn't me! :-) You meant Martin Grund (@grundprinzip)
I will start reviewing closely from late tonight or early tomorrow assuming that we're fine with the plan shared above. I created a new component for CONNECT
, see also https://github.com/apache/spark/pull/37925.
I would be ok with merging a minimal working version as long as it doesn't impact many other components and destabilize the builds and other developers activities. If it doesn't fit this, I think it should either be changed to be decoupled or wait til we think it is. Based on how its a plugin I would think it would be fairly decoupled but I haven't looked at the code yet.
I think we should ideally give people a few more days to give input to make sure they agree. . Unfortunately just seeing this so will take a high level look in the next day or so.
Thanks for your feedback. Yes, it's pretty much decoupled, and I believe this doesn't affect anything to other components. Sure, I will leave it out for more days.
There is a testing plan (Spark Connect API Testing Plan) that I, @amaliujia and @cloud-fan will work on it right away after this PR. I promise that I will make sure on the testing coverage - it's a bit difficult to work together in one branch together because of several issues (https://github.com/apache/spark/pull/37710#issuecomment-1249269885).
FWIW, there is a more detailed design doc at High-Level Design Doc for Spark Connect.
@tgravescs I will follow up on the testing plan doc to address your comments. Please feel free to bring up anything in the doc or here.
@pan3793 thanks for the thorough review. I will address the comments shortly.
Merged to master.
I will follow up and actively work on cleaning up and followup tasks from tomorrow.
Ack, I will regenerate the protos and update.
There's an outstanding comment: https://github.com/apache/spark/pull/37710#discussion_r978291019. I am working on this.
FYI, I made a PR https://github.com/apache/spark/pull/38109 to address https://github.com/apache/spark/pull/37710#discussion_r978291019