incubator-uniffle icon indicating copy to clipboard operation
incubator-uniffle copied to clipboard

[Improvement] Netty replace Grpc on data transfer

Open jerqi opened this issue 3 years ago • 6 comments

When we use Grpc, we find that our bottleneck is on Grpc, Grpc brings the cost of data copy and data serialization. We must encounter GC problems when we use Grpc. We should use Netty replace Grpc on data transfer and use off heap memory to reduce GC time.

jerqi avatar Aug 06 '22 02:08 jerqi

Could we use grpc zero-copy to solve this problem?

By the way, Alluxio also use Grpc to transfer data, I think the problem we encountered also exist in Alluxio. We could refer https://dzone.com/articles/moving-from-apache-thrift-to-grpc-a-perspective-fr?utm_medium=tumblr&utm_source=dlvr.it&utm_campaign=Feed%3A%20dzone%2Fintegration

zuston avatar Sep 24 '22 07:09 zuston

Could we use grpc zero-copy to solve this problem?

By the way, Alluxio also use Grpc to transfer data, I think the problem we encountered also exist in Alluxio. We could refer https://dzone.com/articles/moving-from-apache-thrift-to-grpc-a-perspective-fr?utm_medium=tumblr&utm_source=dlvr.it&utm_campaign=Feed%3A%20dzone%2Fintegration

It seems that it use protobuf to serialize the data. It's not effective enough.

jerqi avatar Sep 26 '22 11:09 jerqi

I haven't look through the full design of GRPC zero-copy and Alluxio's implementation. Maybe we could do the POC to test the performance compared with the netty implementation. Overall, it looks easy to extend the zero-copy by GRPC, like this https://github.com/GoogleCloudPlatform/grpc-gcp-java/pull/77

From my perspective, GRPC is the general and cross-language, that means it make possible to implement the shuffle-server by other languages like rust to avoid GC and obtain better memory management in the future.

zuston avatar Sep 27 '22 02:09 zuston

We have had a POC code about Netty. It will improve 10% performance for Uniffle.

jerqi avatar Sep 27 '22 02:09 jerqi

We have had a POC code about Netty. It will improve 10% performance for Uniffle.

Sounds great! If I have time, I will do a zero-copy POC to compare with netty.

zuston avatar Sep 27 '22 02:09 zuston

After applying the POC of netty, will the problem of #230 exist? @jerqi

zuston avatar Sep 27 '22 10:09 zuston

What is the progress of this proposal, maybe I can join the development?

leixm avatar Oct 21 '22 07:10 leixm

What is the progress of this proposal, maybe I can join the development?

We only have some POC code. I don't have much time, if you like , I can assign this issue to you. And I can provide our POC code to you.

jerqi avatar Oct 21 '22 08:10 jerqi

What is the progress of this proposal, maybe I can join the development?

We only have some POC code. I don't have much time, if you like , I can assign this issue to you. And I can provide our POC code to you.

+1. Could u help make this POC code to another branch.

zuston avatar Oct 21 '22 08:10 zuston

Commit ID: 4463b80deacbcd47ddc151f8d7750d1e5f035077 git apply patch1
netty1.txt

git apply patch2 netty2.txt

After apply the patches, we can get the code https://github.com/jerqi/incubator-uniffle/tree/netty_poc

According the poc test, we can get 10% performance improvement.

jerqi avatar Oct 21 '22 13:10 jerqi

This poc is based on Uber RSS netty implement, it's better that we can compare the Ali's rss netty implement, Uber's rss netty implement and other RSS netty implement.

jerqi avatar Oct 21 '22 14:10 jerqi

Commit ID: 4463b80 git apply patch1 netty1.txt

git apply patch2 netty2.txt

After apply the patches, we can get the code https://github.com/jerqi/incubator-uniffle/tree/netty_poc

According the poc test, we can get 10% performance improvement.

Thank you.

leixm avatar Oct 24 '22 03:10 leixm

@leixm @zuston Do you want to discuss this issue through a meeting? I will start a meeting to discuss the issue #80, I want to discuss this issue, too. There are some other issues which we need to discuss, so I will send a email to our dev mail list, and select a proper date to start the meeting. You can tell me what time you are free by the email.

jerqi avatar Oct 25 '22 06:10 jerqi

@leixm @zuston Do you want to discuss this issue through a meeting? I will start a meeting to discuss the issue #80, I want to discuss this issue, too. There are some other issues which we need to discuss, so I will send a email to our dev mail list, and select a proper date to start the meeting. You can tell me what time you are free by the email.

Let's discuss it.

leixm avatar Oct 25 '22 07:10 leixm

@leixm @zuston Do you want to discuss this issue through a meeting? I will start a meeting to discuss the issue #80, I want to discuss this issue, too. There are some other issues which we need to discuss, so I will send a email to our dev mail list, and select a proper date to start the meeting. You can tell me what time you are free by the email.

Let's discuss it.

+1. Could we have a discussion in this week?

zuston avatar Oct 25 '22 09:10 zuston

@leixm I have already send an email https://lists.apache.org/thread/2jlm3fswmsxy619ldyo4px700p3ybnvc. Do you have time at 11 am (UTC +8) Thursday this week?

jerqi avatar Oct 26 '22 03:10 jerqi

Yes, I'm free at this time.

leixm avatar Oct 26 '22 03:10 leixm

Yes, I'm free at this time.

Meeting link is https://meeting.tencent.com/dm/oR95wASCNe91

jerqi avatar Oct 26 '22 06:10 jerqi

Offline Discussion Result: We will compare the rpc frameworks of other rss, and @leixm will give a design doc. Next step, we will make the shuffle server will off heap as possible as we can.

jerqi avatar Oct 27 '22 04:10 jerqi

https://docs.google.com/document/d/1srlBlLpJ3hbzd8ru5QaY5aCSLR9M2ttHXmlvcFC-SLg/edit?usp=sharing @jerqi @zuston @smallzhongfeng @smallzhongfeng Can you help review this design.

leixm avatar Dec 04 '22 09:12 leixm

Currently, we use Uniffle version 0.6.0 internally. When there are tasks running, the GC of Shuffle Server will take up to 40% of the cpu time, and FullGC will appear occasionally.

leixm avatar Dec 04 '22 10:12 leixm

@leixm Could you give us the authority of the comment for your design doc?

jerqi avatar Dec 04 '22 11:12 jerqi

Thanks @leixm There are some questions.

  1. Do you have a POC test? Could you attach test result?
  2. Do you compare with other plans, such as Uber, Alibaba RSS, Bytedance RSS. Could we reuse their plan?
  3. Do our client need off heap memory to read the data from the server?

jerqi avatar Dec 04 '22 13:12 jerqi

1.I did some POC tests, which can significantly reduce GCTime, I will sort out and upload some test results later 2.I made some comparisons. The design we propose is similar to Alibaba RSS. Do you have a better suggestion? I tried to do some tests on each RSS, but I found it difficult to do a specific performance comparison 3.read buffer refers to the memory usage of ShuffleServer, not the client

leixm avatar Dec 05 '22 04:12 leixm

These RSS have one thing in common, using off-heap memory, their GC time is significantly smaller than Uniffle.

leixm avatar Dec 05 '22 04:12 leixm

1.I did some POC tests, which can significantly reduce GCTime, I will sort out and upload some test results later 2.I made some comparisons. The design we propose is similar to Alibaba RSS. Do you have a better suggestion? I tried to do some tests on each RSS, but I found it difficult to do a specific performance comparison 3.read buffer refers to the memory usage of ShuffleServer, not the client

Could we have analysis about the every RSS's network transfer? How about poc code which I provide? Do they have the same performance?

jerqi avatar Dec 05 '22 04:12 jerqi

@leixm Could you give us the authority of the comment for your design doc?

done, Can you comment now?

leixm avatar Dec 05 '22 04:12 leixm

1.I did some POC tests, which can significantly reduce GCTime, I will sort out and upload some test results later 2.I made some comparisons. The design we propose is similar to Alibaba RSS. Do you have a better suggestion? I tried to do some tests on each RSS, but I found it difficult to do a specific performance comparison 3.read buffer refers to the memory usage of ShuffleServer, not the client

Could we have analysis about the every RSS's network transfer? How about poc code which I provide? Do they have the same performance?

Any suggestions to measure the performance precisely?

leixm avatar Dec 05 '22 04:12 leixm

I think we can evaluate whether the performance has improved in two parts. The first part is GC time, which is relatively simple. You can evaluate it by viewing the process through the running time of the app and jstat. The second part is whether serialization/deserialization has improved , this is more complicated. At present, what I think of is to evaluate through transportTime, which includes the serialization/deserialization time of the client, the network transmission time, and the serialization/deserialization time of ShuffleServer

leixm avatar Dec 05 '22 04:12 leixm

1.I did some POC tests, which can significantly reduce GCTime, I will sort out and upload some test results later 2.I made some comparisons. The design we propose is similar to Alibaba RSS. Do you have a better suggestion? I tried to do some tests on each RSS, but I found it difficult to do a specific performance comparison 3.read buffer refers to the memory usage of ShuffleServer, not the client

Could we have analysis about the every RSS's network transfer? How about poc code which I provide? Do they have the same performance?

Any suggestions to measure the performance precisely?

I don't know whether micro benchmark can solve this problem. GC time can only be evaluted when we use several jobs to verify them.

jerqi avatar Dec 05 '22 04:12 jerqi