incubator-uniffle icon indicating copy to clipboard operation
incubator-uniffle copied to clipboard

[Bug] The same heartbeat is missed while the heartbeat RPC takes a long time.

Open xumanbu opened this issue 4 months ago • 0 comments

Code of Conduct

Search before asking

  • [X] I have searched in the issues and found no similar issues.

Describe the bug

We've added some new nodes to our production environment, and sometimes these nodes experience RPC timeouts (that's a separate issue). This can cause a lot of tasks to fail, and the reason for the failures is that some nodes haven't received heartbeats for a long time, leading to the app data being cleared from the server.

We found that there's a bit of an issue with the timeoutMs in ShuffleWriteClientImpl#sendAppHeartbeat method. The timeoutMs for per RPC is similar to the timeoutMs of ThreadUtils#executeTasks` for all servers RPC execute, and the logic here is flawed.

Affects Version(s)

master

Uniffle Server Log Output

No response

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

  • [X] Yes I am willing to submit a PR!

xumanbu avatar Oct 16 '24 03:10 xumanbu