incubator-uniffle [FEATURE] Add rpc queued time and rpc process time.

[FEATURE] Add rpc queued time and rpc process time.

Open zhengchenyu opened this issue 10 months ago • 5 comments

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Search before asking

[X] I have searched in the issues and found no similar issues.

Describe the feature

When I perform a stress test on the cluster, tasks occasionally encounter errors like this:

Caused by: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: ClientCall was cancelled at or after deadline. [closed=[CANCELLED], committed=[remote_addr=xxx/xxx:xxx]]
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:268)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:249)

We can increase rss.rpc.executor.size to reduce the probability of problems.

In this time, I will see metrics grpc_server_executor_blocking_queue_size like this: 截屏2024-04-15 上午10 40 12

This may mean that rpc requests are not processed in a timely manner. But there are no indicators that directly prove this. So we should add rpc queued time and rpc process time.

Motivation

No response

Describe the solution

No response

Additional context

No response

Are you willing to submit PR?

[X] Yes I am willing to submit a PR!

Apr 15 '24 02:04 zhengchenyu

Please assign this issue to me.

Apr 15 '24 03:04 qijiale76

This looks interesting.

Jul 02 '24 12:07 maobaolong

@maobaolong You can go ahead if you want. I think @qijiale76 might not have time on this. We now mostly use Netty for sending/getting shuffle data. We only use gRPC to send other requests rather than shuffle data. So we may want to add metrics both for Netty and gRPC.

Jul 02 '24 12:07 rickyma

@rickyma Great! It's important to know the pressure of coordinator and shuffle server.

Jul 03 '24 02:07 maobaolong

@maobaolong Feel free to continue working on this issue if you are intrested and have time.

Jul 04 '24 02:07 qijiale76

incubator-uniffle incubator-uniffle copied to clipboard

[FEATURE] Add rpc queued time and rpc process time.

Code of Conduct

Search before asking

Describe the feature

Motivation

Describe the solution

Additional context

Are you willing to submit PR?

incubator-uniffle
incubator-uniffle copied to clipboard