incubator-uniffle icon indicating copy to clipboard operation
incubator-uniffle copied to clipboard

[Bug] ShuffleTaskManager.commitShuffle will get stuck forever if an exception occurs during the flush process

Open rickyma opened this issue 1 year ago • 3 comments

Code of Conduct

Search before asking

  • [X] I have searched in the issues and found no similar issues.

Describe the bug

image

Affects Version(s)

master

Uniffle Server Log Output

jstack:

"Grpc-1788" #2073 daemon prio=5 os_prio=0 cpu=1723.11ms elapsed=88729.16s tid=0x00007f3d3c0f1000 nid=0x968 waiting for monitor entry [0x00007f3cf97fe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.uniffle.server.ShuffleTaskManager.commitShuffle(ShuffleTaskManager.java:338)
        - waiting to lock <0x00007f4fbf708e00> (a java.lang.Object)
        at org.apache.uniffle.server.ShuffleServerGrpcService.finishShuffle(ShuffleServerGrpcService.java:468)
        at org.apache.uniffle.proto.ShuffleServerGrpc$MethodHandlers.invoke(ShuffleServerGrpc.java:1060)
        at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
        at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
        at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:356)
        at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:861)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

"Grpc-1359" #1629 daemon prio=5 os_prio=0 cpu=5536.44ms elapsed=88733.96s tid=0x00007f4380185800 nid=0x7ac waiting on condition [0x00007f41156fe000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.uniffle.server.ShuffleTaskManager.commitShuffle(ShuffleTaskManager.java:360)
        - locked <0x00007f4fbf708e00> (a java.lang.Object)
        at org.apache.uniffle.server.ShuffleServerGrpcService.finishShuffle(ShuffleServerGrpcService.java:468)
        at org.apache.uniffle.proto.ShuffleServerGrpc$MethodHandlers.invoke(ShuffleServerGrpc.java:1060)
        at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
        at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
        at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:356)
        at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:861)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)	

exception log:

[2024-07-03 08:54:32.973] [HadoopFlushEventThreadPool-1] [WARN] SingleStorageManager.write - Exception happened when write data for ShuffleDataFlushEvent: eventId=252896, appId=application_1716779728283_6825960_1719966578466, shuffleId=0, startPartition=315, endPartition=315, retryTimes=0, underStorage=HadoopStorage, isPended=false, ownedByHugePartition=false, try again
org.apache.uniffle.common.exception.RssException: java.io.IOException: All datanodes [DatanodeInfoWithStorage[127.0.0.1:9003,DS-3ad04d12-7d78-405f-ba33-d2bb706f073d,DISK]] are bad. Aborting...
        at org.apache.uniffle.storage.handler.impl.HadoopShuffleWriteHandler.write(HadoopShuffleWriteHandler.java:157)
        at org.apache.uniffle.storage.handler.impl.PooledHadoopShuffleWriteHandler.write(PooledHadoopShuffleWriteHandler.java:122)
        at org.apache.uniffle.server.storage.SingleStorageManager.write(SingleStorageManager.java:59)
        at org.apache.uniffle.server.storage.HybridStorageManager.write(HybridStorageManager.java:130)
        at org.apache.uniffle.server.ShuffleFlushManager.processFlushEvent(ShuffleFlushManager.java:165)
        at org.apache.uniffle.server.DefaultFlushEventHandler.handleEventAndUpdateMetrics(DefaultFlushEventHandler.java:97)
        at org.apache.uniffle.server.DefaultFlushEventHandler.lambda$dispatchEvent$0(DefaultFlushEventHandler.java:219)
        at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: All datanodes [DatanodeInfoWithStorage[127.0.0.1:9003,DS-3ad04d12-7d78-405f-ba33-d2bb706f073d,DISK]] are bad. Aborting...
        at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1567)
        at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1501)
        at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1487)
        at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1262)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:673)

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

rickyma avatar Jul 04 '24 03:07 rickyma

Hi @rickyma, I'm willing to contribute to it. I can raise a PR if you are ok? Thanks!

sahibamatta avatar Jul 04 '24 20:07 sahibamatta

Sure. I'll assign this to you. @sahibamatta

rickyma avatar Jul 05 '24 01:07 rickyma

Hi @rickyma , I've raised a PR for it as per my understanding of the issue 😅. For now, it just handles the exception thrown from the write method, as per mentioned in the screenshot above. Please let me know if we need to handle other parts of the processFlushEvent method as well? Also, feel free to let me know if there’s any gap in my understanding. Thanks!

sahibamatta avatar Jul 05 '24 22:07 sahibamatta