ozone icon indicating copy to clipboard operation
ozone copied to clipboard

HDDS-10749. Shutdown datanode when RatisServer is down

Open ChenSammi opened this issue 10 months ago • 10 comments

What changes were proposed in this pull request?

Currently, when RatisServer is down(mainly due to long GC which exceeds the ratis close threshold), Datanode is still running and in HEALTHY and IN_SERVICE state, which is confusing.

This tasks will shutdown the Datanode after RatisServer is down.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10749

How was this patch tested?

Manual test

ChenSammi avatar Apr 25 '24 04:04 ChenSammi

A normal DN shutdown log, first XceiverServerRatis is stopped, "Stopping XceiverServerRatis 01effdc6-dad1-4bf3-916a-749d9aa7e5e5", then ContainerStateMachine is stopped, "Stopping ContainerStateMachine for group-5EA60976374E".

2024-04-24 17:53:21,589 ERROR ozone.HddsDatanodeService (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 2: SIGINT
2024-04-24 17:53:21,590 INFO  ozone.HddsDatanodeService (StringUtils.java:lambda$startupShutdownMessage$0(144)) - SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down HddsDatanodeService at SAMMICHEN-MB0/0.0.0.0
************************************************************/
2024-04-24 17:53:21,595 INFO  ozoneimpl.OzoneContainer (OzoneContainer.java:stop(482)) - Attempting to stop container services.
2024-04-24 17:53:21,595 WARN  ozoneimpl.AbstractBackgroundContainerScanner (AbstractBackgroundContainerScanner.java:handleRemainingSleep(134)) - Background container scan was interrupted.
2024-04-24 17:53:21,595 INFO  ozoneimpl.AbstractBackgroundContainerScanner (AbstractBackgroundContainerScanner.java:run(61)) - Thread[ContainerMetadataScanner,5,main] exiting.
2024-04-24 17:53:21,595 INFO  ozoneimpl.BackgroundContainerDataScanner (BackgroundContainerDataScanner.java:shutdown(141)) - ContainerDataScanner(/tmp/datanode1/storage/hdds) is shutting down. 
2024-04-24 17:53:21,595 WARN  ozoneimpl.AbstractBackgroundContainerScanner (AbstractBackgroundContainerScanner.java:handleRemainingSleep(134)) - Background container scan was interrupted.
2024-04-24 17:53:21,596 INFO  ozoneimpl.AbstractBackgroundContainerScanner (AbstractBackgroundContainerScanner.java:run(61)) - ContainerDataScanner(/tmp/datanode1/storage/hdds, DS-af727dc0-66f9-4db9-8f1f-8ce487a40766) exiting.
2024-04-24 17:53:21,596 INFO  ozoneimpl.OnDemandContainerDataScanner (OnDemandContainerDataScanner.java:shutdownScanner(206)) - On-demand container scanner is shutting down.
2024-04-24 17:53:21,606 INFO  ratis.XceiverServerRatis (XceiverServerRatis.java:stop(604)) - Stopping XceiverServerRatis 01effdc6-dad1-4bf3-916a-749d9aa7e5e5
2024-04-24 17:53:21,606 INFO  server.RaftServer (RaftServerProxy.java:lambda$close$9(416)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: close
2024-04-24 17:53:21,607 INFO  server.RaftServer$Division (RaftServerImpl.java:lambda$close$3(526)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E: shutdown
2024-04-24 17:53:21,607 INFO  server.GrpcService (GrpcService.java:closeImpl(311)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server org.apache.ratis.grpc.server.GrpcClientProtocolService now
2024-04-24 17:53:21,607 INFO  util.JmxRegister (JmxRegister.java:unregister(73)) - Successfully un-registered JMX Bean with object name Ratis:service=RaftServer,group=group-5EA60976374E,id=01effdc6-dad1-4bf3-916a-749d9aa7e5e5
2024-04-24 17:53:21,607 INFO  impl.RoleInfo (RoleInfo.java:shutdownLeaderState(94)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-LeaderStateImpl
2024-04-24 17:53:21,610 INFO  server.GrpcService (GrpcService.java:closeImpl(320)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server org.apache.ratis.grpc.server.GrpcClientProtocolService successfully
2024-04-24 17:53:21,610 INFO  server.GrpcService (GrpcService.java:closeImpl(311)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server GrpcServerProtocolService now
2024-04-24 17:53:21,611 INFO  server.GrpcService (GrpcService.java:closeImpl(320)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server GrpcServerProtocolService successfully
2024-04-24 17:53:21,611 INFO  server.GrpcService (GrpcService.java:closeImpl(311)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server org.apache.ratis.grpc.server.GrpcAdminProtocolService now
2024-04-24 17:53:21,614 INFO  server.GrpcService (GrpcService.java:closeImpl(320)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server org.apache.ratis.grpc.server.GrpcAdminProtocolService successfully
2024-04-24 17:53:21,614 INFO  impl.PendingRequests (PendingRequests.java:sendNotLeaderResponses(289)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-PendingRequests: sendNotLeaderResponses
2024-04-24 17:53:21,620 INFO  impl.StateMachineUpdater (StateMachineUpdater.java:stopAndJoin(157)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-StateMachineUpdater: set stopIndex = 2
2024-04-24 17:53:21,620 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:takeSnapshot(359)) - group-5EA60976374E: Taking a snapshot at:(t:2, i:2) file /tmp/datanode1/ratis/e9e7ba3c-7686-4b3a-96fd-5ea60976374e/sm/snapshot.2_2
2024-04-24 17:53:21,621 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:takeSnapshot(370)) - group-5EA60976374E: Finished taking a snapshot at:(t:2, i:2) file:/tmp/datanode1/ratis/e9e7ba3c-7686-4b3a-96fd-5ea60976374e/sm/snapshot.2_2 took: 1 ms
2024-04-24 17:53:21,622 INFO  impl.StateMachineUpdater (StateMachineUpdater.java:takeSnapshot(295)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-StateMachineUpdater: Took a snapshot at index 2
2024-04-24 17:53:21,622 INFO  impl.StateMachineUpdater (StateMachineUpdater.java:lambda$new$0(98)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-StateMachineUpdater: snapshotIndex: updateIncreasingly 0 -> 2
2024-04-24 17:53:21,623 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:close(1150)) - Stopping ContainerStateMachine for group-5EA60976374E.
2024-04-24 17:53:21,623 INFO  server.RaftServer$Division (ServerState.java:close(427)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E: applyIndex: 2
2024-04-24 17:53:21,623 INFO  util.AwaitToRun (AwaitToRun.java:run(49)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-cacheEviction-AwaitToRun-AwaitForSignal is interrupted
2024-04-24 17:53:21,695 INFO  segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(245)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-SegmentedRaftLogWorker close()
2024-04-24 17:53:21,697 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(152)) - JvmPauseMonitor-01effdc6-dad1-4bf3-916a-749d9aa7e5e5: Stopped
2024-04-24 17:53:23,783 INFO  volume.HddsVolume (HddsVolume.java:closeDbStore(470)) - SchemaV3 db is stopped at /tmp/datanode1/storage/hdds/CID-9ba4109c-68b1-4311-9623-42f82149fb80/DS-af727dc0-66f9-4db9-8f1f-8ce487a40766/container.db for volume DS-af727dc0-66f9-4db9-8f1f-8ce487a40766
2024-04-24 17:53:23,783 INFO  utils.BackgroundService (BackgroundService.java:shutdown(160)) - Shutting down service BlockDeletingService
2024-04-24 17:53:23,784 INFO  utils.BackgroundService (BackgroundService.java:shutdown(160)) - Shutting down service StaleRecoveringContainerScrubbingService
2024-04-24 17:53:23,785 INFO  statemachine.DatanodeStateMachine (DatanodeStateMachine.java:stopDaemon(640)) - Ozone container server stopped.
2024-04-24 17:53:23,790 INFO  handler.ContextHandler (ContextHandler.java:doStop(1159)) - Stopped o.e.j.w.WebAppContext@3baf6936{hddsDatanode,/,null,STOPPED}{file:/Users/sammi/workspace/hadoop-ozone/hadoop-hdds/container-service/target/classes/webapps/hddsDatanode}
2024-04-24 17:53:23,794 INFO  server.AbstractConnector (AbstractConnector.java:doStop(383)) - Stopped ServerConnector@4f453e63{HTTP/1.1, (http/1.1)}{SAMMICHEN-MB0:9882}
2024-04-24 17:53:23,794 INFO  server.session (HouseKeeper.java:stopScavenging(149)) - node0 Stopped scavenging
2024-04-24 17:53:23,794 INFO  handler.ContextHandler (ContextHandler.java:doStop(1159)) - Stopped o.e.j.s.ServletContextHandler@1816e24a{static,/static,file:///Users/sammi/workspace/hadoop-ozone/hadoop-hdds/container-service/target/classes/webapps/static,STOPPED}
2024-04-24 17:53:23,795 INFO  ozone.HddsDatanodeClientProtocolServer (HddsDatanodeClientProtocolServer.java:stop(83)) - Stopping the RPC server for Client Protocol
2024-04-24 17:53:23,795 INFO  ipc.Server (Server.java:stop(3523)) - Stopping server on 19864
2024-04-24 17:53:23,796 INFO  ipc.Server (Server.java:run(1434)) - Stopping IPC Server listener on 19864
2024-04-24 17:53:23,796 INFO  ipc.Server (Server.java:run(1567)) - Stopping IPC Server Responder

ChenSammi avatar Apr 25 '24 04:04 ChenSammi

A DN shutdown due to Ratis server is shutdown. First ContainerStateMachine is closed, "Container statemachine is closed by ratis, terminating HddsDatanodeService", then XceiverServerRatis is stopped, "Stopping XceiverServerRatis 01effdc6-dad1-4bf3-916a-749d9aa7e5e5".

2024-04-24 18:06:16,666 WARN  util.JvmPauseMonitor (JvmPauseMonitor.java:detectPause(168)) - JvmPauseMonitor-01effdc6-dad1-4bf3-916a-749d9aa7e5e5: Detected pause in JVM or host machine approximately 93.265s without any GCs.
2024-04-24 18:06:16,666 ERROR server.RaftServer (RaftServerProxy.java:handleJvmPause(237)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: JVM pause detected 93.265s longer than the close-threshold 60s, shutting down ...
2024-04-24 18:06:16,678 INFO  server.RaftServer (RaftServerProxy.java:lambda$close$9(416)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: close
2024-04-24 18:06:16,684 INFO  server.RaftServer$Division (RaftServerImpl.java:lambda$close$3(526)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E: shutdown
2024-04-24 18:06:16,685 INFO  server.GrpcService (GrpcService.java:closeImpl(311)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server org.apache.ratis.grpc.server.GrpcClientProtocolService now
2024-04-24 18:06:16,690 INFO  util.JmxRegister (JmxRegister.java:unregister(73)) - Successfully un-registered JMX Bean with object name Ratis:service=RaftServer,group=group-5EA60976374E,id=01effdc6-dad1-4bf3-916a-749d9aa7e5e5
2024-04-24 18:06:16,691 INFO  impl.RoleInfo (RoleInfo.java:shutdownLeaderState(94)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-LeaderStateImpl
2024-04-24 18:06:16,724 INFO  impl.PendingRequests (PendingRequests.java:sendNotLeaderResponses(289)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-PendingRequests: sendNotLeaderResponses
2024-04-24 18:06:16,727 INFO  server.GrpcService (GrpcService.java:closeImpl(320)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server org.apache.ratis.grpc.server.GrpcClientProtocolService successfully
2024-04-24 18:06:16,727 INFO  server.GrpcService (GrpcService.java:closeImpl(311)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server GrpcServerProtocolService now
2024-04-24 18:06:16,728 INFO  impl.StateMachineUpdater (StateMachineUpdater.java:stopAndJoin(157)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-StateMachineUpdater: set stopIndex = 4
2024-04-24 18:06:16,729 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:takeSnapshot(359)) - group-5EA60976374E: Taking a snapshot at:(t:3, i:4) file /tmp/datanode1/ratis/e9e7ba3c-7686-4b3a-96fd-5ea60976374e/sm/snapshot.3_4
2024-04-24 18:06:16,729 INFO  server.GrpcService (GrpcService.java:closeImpl(320)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server GrpcServerProtocolService successfully
2024-04-24 18:06:16,729 INFO  server.GrpcService (GrpcService.java:closeImpl(311)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server org.apache.ratis.grpc.server.GrpcAdminProtocolService now
2024-04-24 18:06:16,732 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:takeSnapshot(370)) - group-5EA60976374E: Finished taking a snapshot at:(t:3, i:4) file:/tmp/datanode1/ratis/e9e7ba3c-7686-4b3a-96fd-5ea60976374e/sm/snapshot.3_4 took: 4 ms
2024-04-24 18:06:16,733 INFO  server.GrpcService (GrpcService.java:closeImpl(320)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server org.apache.ratis.grpc.server.GrpcAdminProtocolService successfully
2024-04-24 18:06:16,734 INFO  impl.StateMachineUpdater (StateMachineUpdater.java:takeSnapshot(295)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-StateMachineUpdater: Took a snapshot at index 4
2024-04-24 18:06:16,734 INFO  impl.StateMachineUpdater (StateMachineUpdater.java:lambda$new$0(98)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-StateMachineUpdater: snapshotIndex: updateIncreasingly 2 -> 4
2024-04-24 18:06:16,740 ERROR ratis.ContainerStateMachine (ContainerStateMachine.java:close(1142)) - Container statemachine is closed by ratis, terminating HddsDatanodeService
2024-04-24 18:06:26,754 INFO  ozoneimpl.OzoneContainer (OzoneContainer.java:stop(482)) - Attempting to stop container services.
2024-04-24 18:06:26,754 WARN  ozoneimpl.AbstractBackgroundContainerScanner (AbstractBackgroundContainerScanner.java:handleRemainingSleep(134)) - Background container scan was interrupted.
2024-04-24 18:06:26,754 INFO  ozoneimpl.AbstractBackgroundContainerScanner (AbstractBackgroundContainerScanner.java:run(61)) - Thread[ContainerMetadataScanner,5,main] exiting.
2024-04-24 18:06:26,755 INFO  ozoneimpl.BackgroundContainerDataScanner (BackgroundContainerDataScanner.java:shutdown(141)) - ContainerDataScanner(/tmp/datanode1/storage/hdds) is shutting down. 
2024-04-24 18:06:26,755 WARN  ozoneimpl.AbstractBackgroundContainerScanner (AbstractBackgroundContainerScanner.java:handleRemainingSleep(134)) - Background container scan was interrupted.
2024-04-24 18:06:26,755 INFO  ozoneimpl.AbstractBackgroundContainerScanner (AbstractBackgroundContainerScanner.java:run(61)) - ContainerDataScanner(/tmp/datanode1/storage/hdds, DS-af727dc0-66f9-4db9-8f1f-8ce487a40766) exiting.
2024-04-24 18:06:26,755 INFO  ozoneimpl.OnDemandContainerDataScanner (OnDemandContainerDataScanner.java:shutdownScanner(206)) - On-demand container scanner is shutting down.
2024-04-24 18:06:26,756 INFO  ratis.XceiverServerRatis (XceiverServerRatis.java:stop(604)) - Stopping XceiverServerRatis 01effdc6-dad1-4bf3-916a-749d9aa7e5e5
2024-04-24 18:06:26,757 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(152)) - JvmPauseMonitor-01effdc6-dad1-4bf3-916a-749d9aa7e5e5: Stopped
2024-04-24 18:06:28,892 INFO  volume.HddsVolume (HddsVolume.java:closeDbStore(470)) - SchemaV3 db is stopped at /tmp/datanode1/storage/hdds/CID-9ba4109c-68b1-4311-9623-42f82149fb80/DS-af727dc0-66f9-4db9-8f1f-8ce487a40766/container.db for volume DS-af727dc0-66f9-4db9-8f1f-8ce487a40766
2024-04-24 18:06:28,893 INFO  utils.BackgroundService (BackgroundService.java:shutdown(160)) - Shutting down service BlockDeletingService
2024-04-24 18:06:28,893 INFO  utils.BackgroundService (BackgroundService.java:shutdown(160)) - Shutting down service StaleRecoveringContainerScrubbingService
2024-04-24 18:06:28,894 INFO  statemachine.DatanodeStateMachine (DatanodeStateMachine.java:stopDaemon(640)) - Ozone container server stopped.
2024-04-24 18:06:28,899 INFO  handler.ContextHandler (ContextHandler.java:doStop(1159)) - Stopped o.e.j.w.WebAppContext@5fbdc49b{hddsDatanode,/,null,STOPPED}{file:/Users/sammi/workspace/hadoop-ozone/hadoop-hdds/container-service/target/classes/webapps/hddsDatanode}
2024-04-24 18:06:28,903 INFO  server.AbstractConnector (AbstractConnector.java:doStop(383)) - Stopped ServerConnector@7fc7c4a{HTTP/1.1, (http/1.1)}{SAMMICHEN-MB0:9882}
2024-04-24 18:06:28,903 INFO  server.session (HouseKeeper.java:stopScavenging(149)) - node0 Stopped scavenging
2024-04-24 18:06:28,903 INFO  handler.ContextHandler (ContextHandler.java:doStop(1159)) - Stopped o.e.j.s.ServletContextHandler@76c387f9{static,/static,file:///Users/sammi/workspace/hadoop-ozone/hadoop-hdds/container-service/target/classes/webapps/static,STOPPED}
2024-04-24 18:06:28,904 INFO  ozone.HddsDatanodeClientProtocolServer (HddsDatanodeClientProtocolServer.java:stop(83)) - Stopping the RPC server for Client Protocol
2024-04-24 18:06:28,905 INFO  ipc.Server (Server.java:stop(3523)) - Stopping server on 19864
2024-04-24 18:06:28,905 INFO  ipc.Server (Server.java:run(1434)) - Stopping IPC Server listener on 19864
2024-04-24 18:06:28,905 INFO  ipc.Server (Server.java:run(1567)) - Stopping IPC Server Responder
2024-04-24 18:06:28,908 INFO  util.ExitUtil (ExitUtil.java:terminate(241)) - Exiting with status 1: ExitException
2024-04-24 18:06:28,909 INFO  ozone.HddsDatanodeService (StringUtils.java:lambda$startupShutdownMessage$0(144)) - SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down HddsDatanodeService at SAMMICHEN-MB0/0.0.0.0
************************************************************/

Process finished with exit code 1

ChenSammi avatar Apr 25 '24 04:04 ChenSammi

@adoroszlai , I noticed the impact to the integration test too. It looks like terminate the DN in ContainerStateMachine is not a good idea for DN. Let me think if there is other solutions.

ChenSammi avatar Apr 30 '24 04:04 ChenSammi

Wait for RATIS release including https://issues.apache.org/jira/browse/RATIS-2066.

ChenSammi avatar May 16 '24 04:05 ChenSammi

I was made aware that for OM if Ratis server experiences a long pause, Ratis state machine crashes itself and that shuts down OM: https://issues.apache.org/jira/browse/HDDS-6141

jojochuang avatar Jun 18 '24 17:06 jojochuang

I was made aware that for OM if Ratis server experiences a long pause, Ratis state machine crashes itself and that shuts down OM: https://issues.apache.org/jira/browse/HDDS-6141

Both OM and SCM will shutdown itself after a long pause.

ChenSammi avatar Jun 28 '24 07:06 ChenSammi

Manual close the datanode, related datanode log

2024-07-01 17:34:54,237 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:notifyServerShutdown(936)) - group-2D6AB2E224A3 is closed by HddsDatanodeService
2024-07-01 17:34:54,774 INFO  segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(248)) - 9c367fb6-68b0-487d-bb10-3e8c0da9b148@group-6CC213E8C815-SegmentedRaftLogWorker close()
2024-07-01 17:34:54,775 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:notifyServerShutdown(936)) - group-6CC213E8C815 is closed by HddsDatanodeService
2024-07-01 17:34:54,805 INFO  segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(248)) - 9c367fb6-68b0-487d-bb10-3e8c0da9b148@group-86A881EBB3A5-SegmentedRaftLogWorker close()
2024-07-01 17:34:54,812 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:notifyServerShutdown(936)) - group-86A881EBB3A5 is closed by HddsDatanodeService

Manual pause DN process and then resume the process

2024-07-01 17:56:07,572 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:notifyServerShutdown(896)) - group-60029B7F6B87 is closed by ratis
2024-07-01 17:56:07,585 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:notifyServerShutdown(896)) - group-86A881EBB3A5 is closed by ratis
2024-07-01 17:56:07,586 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:notifyServerShutdown(896)) - group-6CC213E8C815 is closed by ratis
2024-07-01 17:56:12,580 ERROR ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$notifyServerShutdown$9(916)) - Container statemachine is closed by ratis, terminating HddsDatanodeService. closed(3)/total(3)

ChenSammi avatar Jul 01 '24 10:07 ChenSammi

All three failed misc acceptance runs are due to

failed to solve: process "/bin/sh -c sudo yum install -y openssh-clients openssh-server" did not complete successfully: exit code: 1

It cannot tell from the current logs why it failed. @adoroszlai , do you have any idea about this issue?

ChenSammi avatar Jul 04 '24 08:07 ChenSammi

Looks like the problem is

 > [om  2/15] RUN sudo yum install -y openssh-clients openssh-server:                                                                                                                                                                                      
#0 0.519 Loaded plugins: fastestmirror, ovl                                                                                                                                                                                                                
#0 0.783 Determining fastest mirrors                                                                                                                                                                                                                       
#0 1.328 Could not retrieve mirrorlist http://mirrorlist.centos.org/?release=7&arch=aarch64&repo=os&infra=container error was
#0 1.328 14: curl#6 - "Could not resolve host: mirrorlist.centos.org; Unknown error"
#0 1.338 
#0 1.338 
#0 1.338  One of the configured repositories failed (Unknown),
#0 1.338  and yum doesn't have enough cached data to continue. At this point the only
#0 1.338  safe thing yum can do is fail. There are a few ways to work "fix" this:
#0 1.338 
#0 1.338      1. Contact the upstream for the repository and get them to fix the problem.
#0 1.338 
#0 1.338      2. Reconfigure the baseurl/etc. for the repository, to point to a working
#0 1.338         upstream. This is most often useful if you are using a newer
#0 1.338         distribution release than is supported by the repository (and the
#0 1.338         packages for the previous distribution release still work).
#0 1.338 
#0 1.338      3. Run the command with the repository temporarily disabled
#0 1.338             yum --disablerepo=<repoid> ...
#0 1.338 
#0 1.338      4. Disable the repository permanently, so yum won't use it by default. Yum
#0 1.338         will then just ignore the repository until you permanently enable it
#0 1.338         again or use --enablerepo for temporary usage:
#0 1.338 
#0 1.338             yum-config-manager --disable <repoid>
#0 1.338         or
#0 1.338             subscription-manager repos --disable=<repoid>
#0 1.338 
#0 1.338      5. Configure the failing repository to be skipped, if it is unavailable.
#0 1.338         Note that yum will try to contact the repo. when it runs most commands,
#0 1.338         so will have to try and fail each time (and thus. yum will be be much
#0 1.338         slower). If it is a very temporary problem though, this is often a nice
#0 1.338         compromise:
#0 1.338 
#0 1.338             yum-config-manager --save --setopt=<repoid>.skip_if_unavailable=true
#0 1.338 
#0 1.338 Cannot find a valid baseurl for repo: base/7/aarch64
------
failed to solve: process "/bin/sh -c sudo yum install -y openssh-clients openssh-server" did not complete successfully: exit code: 1

ChenSammi avatar Jul 04 '24 08:07 ChenSammi

failed to solve: process "/bin/sh -c sudo yum install -y openssh-clients openssh-server" did not complete successfully: exit code: 1

@ChenSammi please see #6893. This should be OK after merging from master.

adoroszlai avatar Jul 04 '24 09:07 adoroszlai

Looks like all comments are addressed. HDDS-11092 is merged into HDDS-7593 so the previous error is no longer seen.

jojochuang avatar Jul 29 '24 21:07 jojochuang

Thanks @smengcl @jojochuang @adoroszlai for the review.

ChenSammi avatar Aug 13 '24 08:08 ChenSammi