incubator-uniffle
incubator-uniffle copied to clipboard
[FEATURE] Add rpc queued time and rpc process time.
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Search before asking
- [X] I have searched in the issues and found no similar issues.
Describe the feature
When I perform a stress test on the cluster, tasks occasionally encounter errors like this:
Caused by: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: ClientCall was cancelled at or after deadline. [closed=[CANCELLED], committed=[remote_addr=xxx/xxx:xxx]]
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:268)
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:249)
We can increase
rss.rpc.executor.size
to reduce the probability of problems.
In this time, I will see metrics grpc_server_executor_blocking_queue_size
like this:
This may mean that rpc requests are not processed in a timely manner. But there are no indicators that directly prove this. So we should add rpc queued time and rpc process time.
Motivation
No response
Describe the solution
No response
Additional context
No response
Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
Please assign this issue to me.
This looks interesting.
@maobaolong You can go ahead if you want. I think @qijiale76 might not have time on this. We now mostly use Netty for sending/getting shuffle data. We only use gRPC to send other requests rather than shuffle data. So we may want to add metrics both for Netty and gRPC.
@rickyma Great! It's important to know the pressure of coordinator and shuffle server.
@maobaolong Feel free to continue working on this issue if you are intrested and have time.