incubator-uniffle icon indicating copy to clipboard operation
incubator-uniffle copied to clipboard

[FEATURE] Add rpc queued time and rpc process time.

Open zhengchenyu opened this issue 10 months ago • 5 comments

Code of Conduct

Search before asking

  • [X] I have searched in the issues and found no similar issues.

Describe the feature

When I perform a stress test on the cluster, tasks occasionally encounter errors like this:

Caused by: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: ClientCall was cancelled at or after deadline. [closed=[CANCELLED], committed=[remote_addr=xxx/xxx:xxx]]
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:268)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:249)

We can increase rss.rpc.executor.size to reduce the probability of problems.

In this time, I will see metrics grpc_server_executor_blocking_queue_size like this: 截屏2024-04-15 上午10 40 12

This may mean that rpc requests are not processed in a timely manner. But there are no indicators that directly prove this. So we should add rpc queued time and rpc process time.

Motivation

No response

Describe the solution

No response

Additional context

No response

Are you willing to submit PR?

  • [X] Yes I am willing to submit a PR!

zhengchenyu avatar Apr 15 '24 02:04 zhengchenyu

Please assign this issue to me.

qijiale76 avatar Apr 15 '24 03:04 qijiale76

This looks interesting.

maobaolong avatar Jul 02 '24 12:07 maobaolong

@maobaolong You can go ahead if you want. I think @qijiale76 might not have time on this. We now mostly use Netty for sending/getting shuffle data. We only use gRPC to send other requests rather than shuffle data. So we may want to add metrics both for Netty and gRPC.

rickyma avatar Jul 02 '24 12:07 rickyma

@rickyma Great! It's important to know the pressure of coordinator and shuffle server.

maobaolong avatar Jul 03 '24 02:07 maobaolong

@maobaolong Feel free to continue working on this issue if you are intrested and have time.

qijiale76 avatar Jul 04 '24 02:07 qijiale76