spark [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

What changes were proposed in this pull request?

This is a draft change of the current state of the Spark Connect prototype implemented as a driver plugin to separate the classpaths and shaded dependent libraries. The benefit of a driver plugin is that the plugin has access to everything inside Spark but can itself be a leaf-build. This allows us to shade it's own minimal set of dependencies without brining new dependencies into Spark.

How is Spark Connect Integrated?

Spark Connect is implemented as a driver plugin for Apache Spark. This means that it can be optionally loaded by specifying the following option during startup.

--conf spark.plugins=org.apache.spark.sql.sparkconnect.service.SparkConnectPlugin

The way that the plugin is integrated into the build system does not require any additional changes and the properly configured Jar files are produced as part of the build.

How are GRPC and Guava dependencies handled?

Spark Connect relies both on GRPC and Guava. Since the version of Guava is incompatible to the version used in Spark, the plugin shades this dependency directly as part of the build. In the case of Maven this is very easy in the case of SBT it's a bit trickier. For SBT we hook into the copyDeps stage of the build and instead of copying the artifact dependency we copy the shaded dependency instead.

Why are the changes needed?

https://issues.apache.org/jira/browse/SPARK-39375

Does this PR introduce any user-facing change?

Experimental API for Spark Connect

How was this patch tested?

This patch was added by adding two new test surface areas. On the driver plugin side this patch adds rudimentary test infrastructure that makes sure that Proto plans can be translated into Spark logical plans. Secondly, there are Python integration tests that test the DataFrame to Proto conversion and test the end-to-end execution of Spark Connect queries.

To avoid issues with other parts of the build, only tests that require Spark Connect will be started with the driver plugin.

Note:

The pyspark tests are only support for regular CPython and disabled for pypy
mypy was initially disabled for this patch to reduce friction. Will be enabled again.

Caveats

mypy is disabled on the Python prototype until follow-up patches fix the warnings.
The unit-test coverage in Scala is not good. Follow up tests will increases coverage, but to keep this test reasonable I did not add more code.

New Dependencies

This patch adds GRPC, Protobuf, and an updated Guava to the leaf-build of Spark Connect. These dependencies are not available to the top-level Spark build and are not available to other Spark projects. To avoid dependency issues, the Maven and SBT build will shad GRPC, Protobuf and Guava explicitly.

Follow Ups

Due to the size of the PR, smaller clean-up changes will be submitted as follow up PRs.

Aug 29 '22 16:08 grundprinzip

Can one of the admins verify this patch?

Aug 30 '22 13:08 AmplabJenkins

You can retrigger the ci, the issue already fix by github. https://www.githubstatus.com/incidents/d181frs643d4

Sep 07 '22 09:09 Yikun

I retriggered the builds let's see if it helps.

Sep 07 '22 09:09 grundprinzip

To speed the base image build time, you could move new added python depends in the end of dev/infra/dockerfile when your PR still WIP.

Sep 07 '22 11:09 Yikun

@Yikun the Docker build works fine again, but I'm still facing the issue with the Doc build that causes my build to fail. It happens in a plain SBT build sbt clean package -Phive

https://github.com/grundprinzip/spark/runs/8226589025?check_suite_focus=true#step:21:6379

Any ideas?

Sep 07 '22 11:09 grundprinzip

not having a deep look, not only sbt clean package -Phive failed[2] but also build/sbt -Pkinesis-asl clean compile unidoc[1] failed

But you didn't change any code in [1][2], so this might relate to some change in pom.xml, SparkBuild.scala. (when some profile not enable then failed)

[1] https://github.com/grundprinzip/spark/runs/8226589025?check_suite_focus=true#step:21:483 [2] https://github.com/grundprinzip/spark/runs/8226589025?check_suite_focus=true#step:21:6379

Sep 07 '22 13:09 Yikun

Interestingly the build says:

[info] Main Scala API documentation successful.
[success] Total time: 159 s (02:39), completed Sep 7, 2022 11:38:25 AM
Moving back into docs dir.
Removing old docs

From what @HyukjinKwon told me is that the unidoc build can have false positives due to the way we generate the Javadoc from the Scala code, and according to the Ruby build script for the documentation, the system command finishes with exit code 0 otherwise it would have thrown before.

Sep 08 '22 12:09 grundprinzip

I did more investigations on the build failure. It seems that the doc build for Python (and R) is the only place where we actually call the build/sbt package target. Most of the other targets execute Test/package

Sep 08 '22 19:09 grundprinzip

This is ready for a look now.

Since the whole feature and codes would be very large, we (explicitly I, @grundprinzip, @amaliujia, and @cloud-fan) discussed offline, and decided to propose to split this. This PR is basically the minimal working version note that most of code lines here were generated from the protobuf.

SPARK-39375 is a parent JIRA, and we described the current action items at this moment. More JIRAs will be filed accordingly to the plan below:

High-level plan and design:

Low-level plan:

Short-term

Extend test coverage for SparkConnectPlanner (right now at 76% line coverage)
Extend test coverage for Spark Connect Python client
Type annotations for Spark Connect Python client to re-enable mypy
Clean-up documentation in PySpark code for Spark Connect
Documentation for PySpark in README and doctests
Proto validation in server and/or client
Validation:
- Syntactic -> Parsing
- Semantic -> Analysis
Alternatively only return error class to clients upon failures.
Initial DSL framework for protobuf testing
Restructure the build structure to match with other components
- Maven
- SBT

Long-term

Testing with custom DSL
LocalRelation
Better error handling for semantic failures
Spark and Session configurations
Scala Client
SBT incremental build and testing environment
DataSources
UDFs
Packaging / Releasing

Sep 16 '22 11:09 HyukjinKwon

I am thinking about merging it without making major changes in this PR if there aren't major issues found, and I myself will take a look for important/urgent items very soon according to the plan described above.

I would like to be transparent here. The frank reasons that I think as above are as follow:

Multiple people will intensively cowork together for this component but individual works in a different timezone which makes it difficult to work within this @grundprinzip's branch.
Difficult to manage the credibility. The whole size of work would be very huge, and I would like to avoid sharing the same credit with all the coauthors. Different person will sign off and be the author for individual change.
I would like to speed up by fully leveraging individual fork's GitHub Actions resources. Currently, @grundprinzip's GitHub resource here is a bottleneck.

Hope this plan (1. merging it as the minimal working version without major changes if there aren't major concerns, and 2. sticking to the plan described in https://github.com/apache/spark/pull/37710#issuecomment-1249261232) makes sense to other committers too. Are you guys okay with this? @dongjoon-hyun @viirya @mridulm @srowen @wangyum @sunchao @huaxingao @Yikun @rxin @hvanhovell @tgravescs (derived from SPIP voting).

If people think the general direction is fine, I will start to review it within a couple of days.

Sep 16 '22 11:09 HyukjinKwon

It wasn't me! :-) You meant Martin Grund (@grundprinzip)

Sep 16 '22 12:09 martin-g

I will start reviewing closely from late tonight or early tomorrow assuming that we're fine with the plan shared above. I created a new component for CONNECT, see also https://github.com/apache/spark/pull/37925.

Sep 19 '22 03:09 HyukjinKwon

I would be ok with merging a minimal working version as long as it doesn't impact many other components and destabilize the builds and other developers activities. If it doesn't fit this, I think it should either be changed to be decoupled or wait til we think it is. Based on how its a plugin I would think it would be fairly decoupled but I haven't looked at the code yet.

I think we should ideally give people a few more days to give input to make sure they agree. . Unfortunately just seeing this so will take a high level look in the next day or so.

Sep 19 '22 14:09 tgravescs

Thanks for your feedback. Yes, it's pretty much decoupled, and I believe this doesn't affect anything to other components. Sure, I will leave it out for more days.

Sep 19 '22 14:09 HyukjinKwon

There is a testing plan (Spark Connect API Testing Plan) that I, @amaliujia and @cloud-fan will work on it right away after this PR. I promise that I will make sure on the testing coverage - it's a bit difficult to work together in one branch together because of several issues (https://github.com/apache/spark/pull/37710#issuecomment-1249269885).

FWIW, there is a more detailed design doc at High-Level Design Doc for Spark Connect.

Sep 19 '22 14:09 HyukjinKwon

@tgravescs I will follow up on the testing plan doc to address your comments. Please feel free to bring up anything in the doc or here.

Sep 19 '22 16:09 amaliujia

@pan3793 thanks for the thorough review. I will address the comments shortly.

Sep 20 '22 16:09 grundprinzip

Merged to master.

I will follow up and actively work on cleaning up and followup tasks from tomorrow.

Sep 25 '22 05:09 HyukjinKwon

Ack, I will regenerate the protos and update.

Sep 26 '22 03:09 grundprinzip

There's an outstanding comment: https://github.com/apache/spark/pull/37710#discussion_r978291019. I am working on this.

Sep 27 '22 13:09 HyukjinKwon

FYI, I made a PR https://github.com/apache/spark/pull/38109 to address https://github.com/apache/spark/pull/37710#discussion_r978291019

Oct 05 '22 11:10 HyukjinKwon

spark spark copied to clipboard

[SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

What changes were proposed in this pull request?

How is Spark Connect Integrated?

How are GRPC and Guava dependencies handled?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Caveats

New Dependencies

Follow Ups

High-level plan and design:

Low-level plan:

spark
spark copied to clipboard