flink-remote-shuffle icon indicating copy to clipboard operation
flink-remote-shuffle copied to clipboard

Explicitly exit ShuffleWorker process when terminate future finished

Open Aitozi opened this issue 2 years ago • 1 comments

In our usage, we encounter a case where the shuffle worker registers timeout and triggers a fatal error, but the shuffle worker process does not exit and this leads to no new worker being spawned to replace the current one .

The reason behind this is that the shuffle worker will execute closeAsync and shutdown all the component services. Obviously, the process will exit after all the non-daemon threads exit. But our metric client start extra thread not close rightly which cause this problem, this should fix by close these threads in the reporter#close method.

But I still think we should improve the shutdown logic a bit. We could explicitly exit the shuffle worker when the termination future completed. So that it will be safe for any situation when there are threads that can not be freed timely.

Aitozi avatar Jul 24 '22 10:07 Aitozi

cc @wsry

Aitozi avatar Jul 24 '22 10:07 Aitozi