kyuubi
kyuubi copied to clipboard
[TASK][CHALLENGE] Support Spark Connect Frontend/Backend
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Search before creating
- [X] I have searched in the task list and found no similar tasks.
Mentor
- [X] I have sufficient knowledge and experience of this task, and I volunteer to be the mentor of this task to guide contributors to complete the task.
Skill requirements
- Knowledge about Spark Connect
- Knowledge about Kyuubi architecture
- Knowledge about protobuf
- Knowledge about grpc
- Knowledge about thrift
Background and Goals
Make Kyuubi server compatible with Spark Connect protocol, so that people can use Spark Connect client to connect to Kyuubi Server.
Implementation steps
-
Add a new Spark Connect frontend 1.1 Add basic gRpc server as frontend 1.2 Compatible with Spark Connect protocol, see https://github.com/apache/spark/blob/master/connector/connect/common/src/main/protobuf/spark/connect/base.proto 1.3 Support ExecutePlan 1.4 Support AnalyzePlan 1.5 Support Config 1.6 Support AddArtifacts 1.7 Support ArtifactsStatus 1.8 Support Interrupt 1.9 Support ReattachExecute 1.10 Support ReleaseExecute 1.11 Serialize the protobuf based request
-
Add a new Spark Connect backend 2.1 Imprort Sprak-Connect-Server and rewrite SparkConnectService https://github.com/apache/spark/blob/master/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectServer.scala 2.2 Deserialize response to protobuf based
-
Add IT
-
Add docs
Additional context
Introduction of https://github.com/apache/kyuubi/issues/6232
I think this is very challenging, but I want to give it a try. Can you assign it to me and help me
sure, thank you @yehere ! This is a kind of umbrella, we can create sub issue one by one later.
This huge task could be divided into several different level tasks, feel free to go ahead ~ all your contributions will be counted eventually :)
cc @cfmcgrady
sure, thank you @yehere ! This is a kind of umbrella, we can create sub issue one by one later.
I'm also interested in it, hope to work together.
thank you @zhaomin1423 , glad to see you are interested in.
how about co-located mode with kyuubi's sparksql engine? separated service is good and basic, but also needs more resources for more spark instances.
@minyk there are in different process, just like Spark thirftserver and connect server. We are going to add a new module and new server for Kyuubi connect. We can do it together if you are interested in.
@ulysses-you hello,i'm interested with this component,hope work with you
@yaooqinn @pan3793 @ulysses-you i found spark have package the connect module to maven repository https://mvnrepository.com/artifact/org.apache.spark/spark-connect-client-jvm_2.13/3.4.0 can we use thoes package to simplify the code? maybe we could provide a connectingstr to sparkSession like the code description https://github.com/apache/spark/blob/master/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala based spark-connect package, we can reduce grpcServer and proto what do you think?
I haven't had deep look at it, my current thought is,
- for server part, we only need a thin gRPC layer, coping proto files and regenerating gRPC files is fine.
- for engine part, we can reuse the connect-server module to simplify the code.
@davidyuan1223 sure, please go ahead. +1 for @pan3793 thought.
@ulysses-you @pan3793 Understand, I'd like to try this challenging issue, which could go on for a long time, as I need to go through the whole architecture of kyuubi-server and figure out the differences between it and spark-connect, and in the process I might have some discussions with you. Could youe assigned this issue to me?
Just to clarify here, the intention is to support spark connect client as another connection type to the engine - so you could still use jdbc or notebook (via rest) to the same Spark engine and have all those clients to the same application?
Just to clarify here, the intention is to support spark connect client as another connection type to the engine - so you could still use jdbc or notebook (via rest) to the same Spark engine and have all those clients to the same application?
Yes, my initial assumption is to create a 3.4-based sparkSession by providing the configuration item remote connection str and then merging it with thrift service to provide the corresponding engine(so this configuration must force a check of the spark version > 3.4, while spark-connect-client has already written sparkSession to reduce our development process), what do you think?
@tgravescs that's a good question, and we did have an offline discussion about it.
TL;DR, your assumption will be the ultimate version, but not at the beginning.
As you know the current main flow of Kyuubi is:
===[http]
client ===[thrift]====> Server ===[thrift]===> Engine
===[etc.] ---[thrift]---> STS/HS2/Impala (we know someone implemented such a feature internally)
The engine itself is kind of a regular Spark app that basically only consumes Spark's public API, making it easily compatible with multiple Spark versions. As connect is a new feature and connect-server
is not supposed to be exposed to the user directly(I suppose only gRPC API is public API in this case), pulling connect-server
in the current Spark engine module directly would break the current assumption. So in the experimental phase we are going to create a dedicated engine module for the connect
engine, we may call it SPARK_CONNECT
(the current one is SPARK_SQL
)。
Another important case is Server ===[thrift]===> Engine
, currently, we use Thrift(more specifically, the HiveServer2 Thrift protocol) as the internal RPC protocol, but for connect, obviously gRPC should be used, and keep two internal RPC protocol is quite complex and redundant, we tend to create a dedicated experimental server that keeps similar architecture but rewrite the RPC implementation.
Once the PoC is completed, we can consider merging servers and engines to achieve the final vision as you said.
===[http]
===[grpc]
client ===[thrift]====> Server ===[grpc]===> Engine
===[etc.] --[thrift]--> STS/HS2/Impala (we know someone implemented such a feature internally)
Maybe @yaooqinn can share more information
@pan3793 @yaooqinn @ulysses-you @tgravescs
Hello, I have analyzed the processing flow of spark-connect, as shown in the following figure.
- SparkSession.builder.remote(host:port).getOrCreate() to create a SparkConnectClient(RPCClient)
- spark.sql(xxx), acutually, this method is build a rpcRequest then use RPClient to process with Spark-Connect-Server
- Then Spark-Connect-Server receive the request and process it with local sparkSession, finally, return the rpcResponse
- The client sparkSession receive the rpcResponse will resolve it then return
As mentioned above, I believe that in the RPC request process of kyuubi based on SparkConnect, we no longer need the involvement of SparkSession, so I have designed the following process:
- We will implement a KyuubiSparkConnectClient(RPCClient, based on SparkConnectClient). It will be created when we use EngineRef.getOrCreate to create a KyuubiSparkConnectEngine
- Examples, like beeline, when we use beeline to execute sql, it will create a thrift request to the KyuubiSparkConnectFrontendService
- The frontendService will not do any thing, just like other engine, then the frontendService will post request to KyuubiSparkConnectService(client: KyuubiSparkConnectClient)
- The backendService also like other engine, it will use corresponding operation to handle the request
- The operation will process like the follow 5.1 Process the thrift request and tranform it to rpc request 5.2 Call client method to process the request 5.3 Receive the rpc response from the Spark-Connect-Server 5.4 Tranform the rpc response to thrift response
Based the rpc client, we don't need create sparkSession
What do you think?
@pan3793 @yaooqinn Hi! Just to clarify - do I understand correctly, that for the first iteration, we need to somehow allow gRPC-based engines to coexist with Thrift ones (all current engines) in order to add the SPARK_CONNECT
engine? That also means, that we will need to rewrite a significant part of the internal communication logic, includingKyuubiSession
, SessionManager
, etc. or decouple it from Thrift.
Or it is expected to start directly from rewriting the current internal RPC mechanism from Thrift (HS2) to gRPC and changing the internal API (kyuubi frontend server <--> engine
), so that it will include logical methods from both the old API and the Spark Connect API?
currently, we use Thrift(more specifically, the HiveServer2 Thrift protocol) as the internal RPC protocol, but for connect, obviously gRPC should be used
for the first iteration, we need to somehow allow gRPC-based engines to coexist with Thrift ones (all current engines) in order to add the SPARK_CONNECT engine? That also means, that we will need to rewrite a significant part of the internal communication logic, includingKyuubiSession, SessionManager, etc. or decouple it from Thrift.
@tigrulya-exe Exactly! I'm doing some experiments in this way, and it does involve lots of refactoring work to support both Thrift and gRPC and reuse code as much as possible. I can not promise an ETA since I'm not sure how much time I can spend on this task in the next few months. But I will open a draft PR once I make the pipeline work (for example, successfully executing select 1
using a spark-connect client), meanwhile, I will separate the refactoring changes and push them to the master branch gradually.
@pan3793 great! I would like to participate in the development process, if it's possible :) Do you already have a list of kyuubi parts/classes/modules to refactor, so we can break this big task down into smaller parts to be able to work simultaneously?
Btw, I also noticed that there is #6412 PR, related to this issue. @davidyuan1223 Hi! Is it still active?
@tigrulya-exe I will share with you more details in the next one or two weeks.
@pan3793 great! I would like to participate in the development process, if it's possible :) Do you already have a list of kyuubi parts/classes/modules to refactor, so we can break this big task down into smaller parts to be able to work simultaneously?
Btw, I also noticed that there is #6412 PR, related to this issue. @davidyuan1223 Hi! Is it still active?
Yeah, it's active, you could see this pr #6412. We first need to verify the feasibility of this solution, but the spark-connect latest version 3.5.1 has some question, so i'm waitting for the new version 3.5.2 release(currently it's released). And i will verify the spark-connect-3.5.2 this week
A quick and dirty version of Kyuubi Connect is available at https://github.com/apache/kyuubi/pull/6642
@pan3793 Hi! I checked your PoC and built it locally. I tried to run some queries using pyspark and they finished successfully, nice work! Now, I suggest creating a list of tasks that are required to complete this solution. These tasks include supporting all gRPC Spark Connect API methods and refactoring the current code to seamlessly integrate the PoC. This will allow us to work simultaneously and add functionality to the master branch more quickly.
Could you please share any changes that break the current thrift-based logic and any things that need to be refactored that you noticed during the implementation of this solution, so we can use this information as a starting point?