incubator-uniffle
incubator-uniffle copied to clipboard
[Bug] ShuffleTaskManager.commitShuffle will get stuck forever if an exception occurs during the flush process
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Search before asking
- [X] I have searched in the issues and found no similar issues.
Describe the bug
Affects Version(s)
master
Uniffle Server Log Output
jstack:
"Grpc-1788" #2073 daemon prio=5 os_prio=0 cpu=1723.11ms elapsed=88729.16s tid=0x00007f3d3c0f1000 nid=0x968 waiting for monitor entry [0x00007f3cf97fe000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.uniffle.server.ShuffleTaskManager.commitShuffle(ShuffleTaskManager.java:338)
- waiting to lock <0x00007f4fbf708e00> (a java.lang.Object)
at org.apache.uniffle.server.ShuffleServerGrpcService.finishShuffle(ShuffleServerGrpcService.java:468)
at org.apache.uniffle.proto.ShuffleServerGrpc$MethodHandlers.invoke(ShuffleServerGrpc.java:1060)
at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:356)
at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:861)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
"Grpc-1359" #1629 daemon prio=5 os_prio=0 cpu=5536.44ms elapsed=88733.96s tid=0x00007f4380185800 nid=0x7ac waiting on condition [0x00007f41156fe000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.uniffle.server.ShuffleTaskManager.commitShuffle(ShuffleTaskManager.java:360)
- locked <0x00007f4fbf708e00> (a java.lang.Object)
at org.apache.uniffle.server.ShuffleServerGrpcService.finishShuffle(ShuffleServerGrpcService.java:468)
at org.apache.uniffle.proto.ShuffleServerGrpc$MethodHandlers.invoke(ShuffleServerGrpc.java:1060)
at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:356)
at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:861)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
exception log:
[2024-07-03 08:54:32.973] [HadoopFlushEventThreadPool-1] [WARN] SingleStorageManager.write - Exception happened when write data for ShuffleDataFlushEvent: eventId=252896, appId=application_1716779728283_6825960_1719966578466, shuffleId=0, startPartition=315, endPartition=315, retryTimes=0, underStorage=HadoopStorage, isPended=false, ownedByHugePartition=false, try again
org.apache.uniffle.common.exception.RssException: java.io.IOException: All datanodes [DatanodeInfoWithStorage[127.0.0.1:9003,DS-3ad04d12-7d78-405f-ba33-d2bb706f073d,DISK]] are bad. Aborting...
at org.apache.uniffle.storage.handler.impl.HadoopShuffleWriteHandler.write(HadoopShuffleWriteHandler.java:157)
at org.apache.uniffle.storage.handler.impl.PooledHadoopShuffleWriteHandler.write(PooledHadoopShuffleWriteHandler.java:122)
at org.apache.uniffle.server.storage.SingleStorageManager.write(SingleStorageManager.java:59)
at org.apache.uniffle.server.storage.HybridStorageManager.write(HybridStorageManager.java:130)
at org.apache.uniffle.server.ShuffleFlushManager.processFlushEvent(ShuffleFlushManager.java:165)
at org.apache.uniffle.server.DefaultFlushEventHandler.handleEventAndUpdateMetrics(DefaultFlushEventHandler.java:97)
at org.apache.uniffle.server.DefaultFlushEventHandler.lambda$dispatchEvent$0(DefaultFlushEventHandler.java:219)
at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: All datanodes [DatanodeInfoWithStorage[127.0.0.1:9003,DS-3ad04d12-7d78-405f-ba33-d2bb706f073d,DISK]] are bad. Aborting...
at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1567)
at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1501)
at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1487)
at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1262)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:673)
Uniffle Engine Log Output
No response
Uniffle Server Configurations
No response
Uniffle Engine Configurations
No response
Additional context
No response
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
Hi @rickyma, I'm willing to contribute to it. I can raise a PR if you are ok? Thanks!
Sure. I'll assign this to you. @sahibamatta
Hi @rickyma , I've raised a PR for it as per my understanding of the issue 😅. For now, it just handles the exception thrown from the write method, as per mentioned in the screenshot above. Please let me know if we need to handle other parts of the processFlushEvent method as well? Also, feel free to let me know if there’s any gap in my understanding.
Thanks!