kyuubi icon indicating copy to clipboard operation
kyuubi copied to clipboard

[TASK][CHALLENGE] Support Spark Connect Frontend/Backend

Open ulysses-you opened this issue 1 year ago • 24 comments

Code of Conduct

Search before creating

  • [X] I have searched in the task list and found no similar tasks.

Mentor

  • [X] I have sufficient knowledge and experience of this task, and I volunteer to be the mentor of this task to guide contributors to complete the task.

Skill requirements

  • Knowledge about Spark Connect
  • Knowledge about Kyuubi architecture
  • Knowledge about protobuf
  • Knowledge about grpc
  • Knowledge about thrift

Background and Goals

Make Kyuubi server compatible with Spark Connect protocol, so that people can use Spark Connect client to connect to Kyuubi Server.

image

Implementation steps

  1. Add a new Spark Connect frontend 1.1 Add basic gRpc server as frontend 1.2 Compatible with Spark Connect protocol, see https://github.com/apache/spark/blob/master/connector/connect/common/src/main/protobuf/spark/connect/base.proto 1.3 Support ExecutePlan 1.4 Support AnalyzePlan 1.5 Support Config 1.6 Support AddArtifacts 1.7 Support ArtifactsStatus 1.8 Support Interrupt 1.9 Support ReattachExecute 1.10 Support ReleaseExecute 1.11 Serialize the protobuf based request

  2. Add a new Spark Connect backend 2.1 Imprort Sprak-Connect-Server and rewrite SparkConnectService https://github.com/apache/spark/blob/master/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectServer.scala 2.2 Deserialize response to protobuf based

  3. Add IT

  4. Add docs

Additional context

Introduction of https://github.com/apache/kyuubi/issues/6232

ulysses-you avatar Oct 09 '23 02:10 ulysses-you

I think this is very challenging, but I want to give it a try. Can you assign it to me and help me

yehere avatar Oct 10 '23 11:10 yehere

sure, thank you @yehere ! This is a kind of umbrella, we can create sub issue one by one later.

ulysses-you avatar Oct 10 '23 12:10 ulysses-you

This huge task could be divided into several different level tasks, feel free to go ahead ~ all your contributions will be counted eventually :)

pan3793 avatar Oct 10 '23 12:10 pan3793

cc @cfmcgrady

cfmcgrady avatar Oct 11 '23 02:10 cfmcgrady

sure, thank you @yehere ! This is a kind of umbrella, we can create sub issue one by one later.

I'm also interested in it, hope to work together.

zhaomin1423 avatar Oct 12 '23 01:10 zhaomin1423

thank you @zhaomin1423 , glad to see you are interested in.

ulysses-you avatar Oct 12 '23 10:10 ulysses-you

how about co-located mode with kyuubi's sparksql engine? separated service is good and basic, but also needs more resources for more spark instances.

minyk avatar Oct 19 '23 02:10 minyk

@minyk there are in different process, just like Spark thirftserver and connect server. We are going to add a new module and new server for Kyuubi connect. We can do it together if you are interested in.

ulysses-you avatar Oct 19 '23 02:10 ulysses-you

@ulysses-you hello,i'm interested with this component,hope work with you

davidyuan1223 avatar Apr 04 '24 11:04 davidyuan1223

@yaooqinn @pan3793 @ulysses-you i found spark have package the connect module to maven repository https://mvnrepository.com/artifact/org.apache.spark/spark-connect-client-jvm_2.13/3.4.0 can we use thoes package to simplify the code? maybe we could provide a connectingstr to sparkSession like the code description https://github.com/apache/spark/blob/master/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala based spark-connect package, we can reduce grpcServer and proto what do you think?

davidyuan1223 avatar Apr 16 '24 14:04 davidyuan1223

I haven't had deep look at it, my current thought is,

  1. for server part, we only need a thin gRPC layer, coping proto files and regenerating gRPC files is fine.
  2. for engine part, we can reuse the connect-server module to simplify the code.

pan3793 avatar Apr 16 '24 14:04 pan3793

@davidyuan1223 sure, please go ahead. +1 for @pan3793 thought.

ulysses-you avatar Apr 17 '24 04:04 ulysses-you

@ulysses-you @pan3793 Understand, I'd like to try this challenging issue, which could go on for a long time, as I need to go through the whole architecture of kyuubi-server and figure out the differences between it and spark-connect, and in the process I might have some discussions with you. Could youe assigned this issue to me?

davidyuan1223 avatar Apr 17 '24 11:04 davidyuan1223

Just to clarify here, the intention is to support spark connect client as another connection type to the engine - so you could still use jdbc or notebook (via rest) to the same Spark engine and have all those clients to the same application?

tgravescs avatar Apr 17 '24 13:04 tgravescs

Just to clarify here, the intention is to support spark connect client as another connection type to the engine - so you could still use jdbc or notebook (via rest) to the same Spark engine and have all those clients to the same application?

Yes, my initial assumption is to create a 3.4-based sparkSession by providing the configuration item remote connection str and then merging it with thrift service to provide the corresponding engine(so this configuration must force a check of the spark version > 3.4, while spark-connect-client has already written sparkSession to reduce our development process), what do you think?

davidyuan1223 avatar Apr 17 '24 15:04 davidyuan1223

@tgravescs that's a good question, and we did have an offline discussion about it.

TL;DR, your assumption will be the ultimate version, but not at the beginning.

As you know the current main flow of Kyuubi is:

       ===[http]
client ===[thrift]====> Server ===[thrift]===> Engine
       ===[etc.]               ---[thrift]---> STS/HS2/Impala (we know someone implemented such a feature internally)

The engine itself is kind of a regular Spark app that basically only consumes Spark's public API, making it easily compatible with multiple Spark versions. As connect is a new feature and connect-server is not supposed to be exposed to the user directly(I suppose only gRPC API is public API in this case), pulling connect-server in the current Spark engine module directly would break the current assumption. So in the experimental phase we are going to create a dedicated engine module for the connect engine, we may call it SPARK_CONNECT(the current one is SPARK_SQL)。

Another important case is Server ===[thrift]===> Engine, currently, we use Thrift(more specifically, the HiveServer2 Thrift protocol) as the internal RPC protocol, but for connect, obviously gRPC should be used, and keep two internal RPC protocol is quite complex and redundant, we tend to create a dedicated experimental server that keeps similar architecture but rewrite the RPC implementation.

Once the PoC is completed, we can consider merging servers and engines to achieve the final vision as you said.

       ===[http]
       ===[grpc]
client ===[thrift]====> Server ===[grpc]===> Engine
       ===[etc.]               --[thrift]--> STS/HS2/Impala (we know someone implemented such a feature internally)

Maybe @yaooqinn can share more information

pan3793 avatar Apr 18 '24 02:04 pan3793

@pan3793 @yaooqinn @ulysses-you @tgravescs Hello, I have analyzed the processing flow of spark-connect, as shown in the following figure. image

  1. SparkSession.builder.remote(host:port).getOrCreate() to create a SparkConnectClient(RPCClient)
  2. spark.sql(xxx), acutually, this method is build a rpcRequest then use RPClient to process with Spark-Connect-Server
  3. Then Spark-Connect-Server receive the request and process it with local sparkSession, finally, return the rpcResponse
  4. The client sparkSession receive the rpcResponse will resolve it then return

As mentioned above, I believe that in the RPC request process of kyuubi based on SparkConnect, we no longer need the involvement of SparkSession, so I have designed the following process: image

  1. We will implement a KyuubiSparkConnectClient(RPCClient, based on SparkConnectClient). It will be created when we use EngineRef.getOrCreate to create a KyuubiSparkConnectEngine
  2. Examples, like beeline, when we use beeline to execute sql, it will create a thrift request to the KyuubiSparkConnectFrontendService
  3. The frontendService will not do any thing, just like other engine, then the frontendService will post request to KyuubiSparkConnectService(client: KyuubiSparkConnectClient)
  4. The backendService also like other engine, it will use corresponding operation to handle the request
  5. The operation will process like the follow 5.1 Process the thrift request and tranform it to rpc request 5.2 Call client method to process the request 5.3 Receive the rpc response from the Spark-Connect-Server 5.4 Tranform the rpc response to thrift response

Based the rpc client, we don't need create sparkSession

What do you think?

davidyuan1223 avatar Apr 18 '24 17:04 davidyuan1223

@pan3793 @yaooqinn Hi! Just to clarify - do I understand correctly, that for the first iteration, we need to somehow allow gRPC-based engines to coexist with Thrift ones (all current engines) in order to add the SPARK_CONNECT engine? That also means, that we will need to rewrite a significant part of the internal communication logic, includingKyuubiSession, SessionManager, etc. or decouple it from Thrift.

Or it is expected to start directly from rewriting the current internal RPC mechanism from Thrift (HS2) to gRPC and changing the internal API (kyuubi frontend server <--> engine), so that it will include logical methods from both the old API and the Spark Connect API?

currently, we use Thrift(more specifically, the HiveServer2 Thrift protocol) as the internal RPC protocol, but for connect, obviously gRPC should be used

tigrulya-exe avatar Aug 15 '24 13:08 tigrulya-exe

for the first iteration, we need to somehow allow gRPC-based engines to coexist with Thrift ones (all current engines) in order to add the SPARK_CONNECT engine? That also means, that we will need to rewrite a significant part of the internal communication logic, includingKyuubiSession, SessionManager, etc. or decouple it from Thrift.

@tigrulya-exe Exactly! I'm doing some experiments in this way, and it does involve lots of refactoring work to support both Thrift and gRPC and reuse code as much as possible. I can not promise an ETA since I'm not sure how much time I can spend on this task in the next few months. But I will open a draft PR once I make the pipeline work (for example, successfully executing select 1 using a spark-connect client), meanwhile, I will separate the refactoring changes and push them to the master branch gradually.

pan3793 avatar Aug 15 '24 13:08 pan3793

@pan3793 great! I would like to participate in the development process, if it's possible :) Do you already have a list of kyuubi parts/classes/modules to refactor, so we can break this big task down into smaller parts to be able to work simultaneously?

Btw, I also noticed that there is #6412 PR, related to this issue. @davidyuan1223 Hi! Is it still active?

tigrulya-exe avatar Aug 15 '24 14:08 tigrulya-exe

@tigrulya-exe I will share with you more details in the next one or two weeks.

pan3793 avatar Aug 15 '24 14:08 pan3793

@pan3793 great! I would like to participate in the development process, if it's possible :) Do you already have a list of kyuubi parts/classes/modules to refactor, so we can break this big task down into smaller parts to be able to work simultaneously?

Btw, I also noticed that there is #6412 PR, related to this issue. @davidyuan1223 Hi! Is it still active?

Yeah, it's active, you could see this pr #6412. We first need to verify the feasibility of this solution, but the spark-connect latest version 3.5.1 has some question, so i'm waitting for the new version 3.5.2 release(currently it's released). And i will verify the spark-connect-3.5.2 this week

davidyuan1223 avatar Aug 16 '24 01:08 davidyuan1223

A quick and dirty version of Kyuubi Connect is available at https://github.com/apache/kyuubi/pull/6642

pan3793 avatar Aug 23 '24 16:08 pan3793

@pan3793 Hi! I checked your PoC and built it locally. I tried to run some queries using pyspark and they finished successfully, nice work! Now, I suggest creating a list of tasks that are required to complete this solution. These tasks include supporting all gRPC Spark Connect API methods and refactoring the current code to seamlessly integrate the PoC. This will allow us to work simultaneously and add functionality to the master branch more quickly.

Could you please share any changes that break the current thrift-based logic and any things that need to be refactored that you noticed during the implementation of this solution, so we can use this information as a starting point?

tigrulya-exe avatar Sep 02 '24 13:09 tigrulya-exe