incubator-uniffle
incubator-uniffle copied to clipboard
[Bug] The same heartbeat is missed while the heartbeat RPC takes a long time.
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Search before asking
- [X] I have searched in the issues and found no similar issues.
Describe the bug
We've added some new nodes to our production environment, and sometimes these nodes experience RPC timeouts (that's a separate issue). This can cause a lot of tasks to fail, and the reason for the failures is that some nodes haven't received heartbeats for a long time, leading to the app data being cleared from the server.
We found that there's a bit of an issue with the timeoutMs
in ShuffleWriteClientImpl#sendAppHeartbeat
method. The timeoutMs
for per RPC is similar to the timeoutMs of
ThreadUtils#executeTasks` for all servers RPC execute, and the logic here is flawed.
Affects Version(s)
master
Uniffle Server Log Output
No response
Uniffle Engine Log Output
No response
Uniffle Server Configurations
No response
Uniffle Engine Configurations
No response
Additional context
No response
Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!