cardano-node icon indicating copy to clipboard operation
cardano-node copied to clipboard

[BUG] - Connection Handshake Error and Promotion Failure of Cold Big Ledger Peers in Cardano Node v8.9.2 Docker Image

Open astroz777 opened this issue 10 months ago • 8 comments

Internal Issue

Area: Node Connection Management

Summary: The Cardano node experiences connectivity issues with establishing and maintaining peer connections. These issues result in handshake errors and connection failures. Additionally, promotion attempts of cold big ledger peers fail due to connection errors, indicating potential issues with the availability of remote nodes. These problems significantly affect the node's performance, leading to a decrease in download speed by 90% and the validation of only one slot per second, particularly after reaching 120 million slots.

Steps to reproduce:

1.Download the Docker image ghcr.io/intersectmbo/cardano-node:8.9.2. 2.Use Docker to create a container. 3.Monitor the logs for errors related to connection management and promotion of cold big ledger peers.

System info:

OS Name: Ubuntu OS Version: 22.04.1 Node version: Docker image ghcr.io/intersectmbo/cardano-node:8.9.2. CLI version: Docker image ghcr.io/intersectmbo/cardano-node:8.9.2. Docker version 26.1.0, build 9714adc

Additional context: The logs indicate various errors, including handshake errors during connection establishment and connection refusal while attempting to promote cold big ledger peers. These errors suggest potential issues with the availability or accessibility of remote nodes, impacting the node's overall performance.

Initial Logs:

Logs: 2024-04-23 18:46:27 [8c5f5845:cardano.node.ChainDB:Info:5] [2024-04-23 21:46:27.09 UTC] Closed db with immutable tip at 0c08287b680a798aa7176d83fbacfa8bb196e62b18bd904037d497bf237fd3a0 at slot 122296839 and tip 6c2a171daf4d0350e1995fd718e79dbeb170f1a5403764c1605dfa64a1e662b6 at slot 122342476 2024-04-23 18:46:27 2024-04-23 18:46:27 Shutting down.. 2024-04-23 18:46:27 cardano-node: thread killed 2024-04-23 18:59:36 Starting: exec /nix/store/9ph9pp8d21y17k95mhgk8jk2z2036dhj-cardano-node-exe-cardano-node-8.9.2/bin/cardano-node run 2024-04-23 18:59:36 --config /nix/store/76zh2cy72zz8bxh7k5c4v7rgiba9qrk1-config-0-0.json 2024-04-23 18:59:36 --database-path /data/db 2024-04-23 18:59:36 --topology /nix/store/czj80vj5a9vg1xdg3sgqpfj0hc8zv3pb-topology.yaml 2024-04-23 18:59:36 --host-addr 0.0.0.0 2024-04-23 18:59:36 --port 3001 2024-04-23 18:59:36 --socket-path /ipc/node.socket Connection Handshake Error: Logs:

[8c5f5845:cardano.node.ConnectionManager:Info:209] [2024-04-23 21:27:11.24 UTC] TrConnectionHandler (ConnectionId {localAddress = 172.17.0.2:3001, remoteAddress = 13.41.234.173:3001}) (TrConnectionHandlerError OutboundError ExceededTimeLimit ...) Summary: A connection handshake error occurred while attempting to establish a connection with a remote node. The error message indicates that the handshake process exceeded the time limit.

Promotion Failure of Cold Big Ledger Peer: Logs:

[8c5f5845:cardano.node.PeerSelection:Info:172] [2024-04-23 21:27:24.28 UTC] TracePromoteColdBigLedgerPeerFailed 10 9 3.216.77.109:3001 9.624079842171s Network.Socket.connect: <socket: 45>: does not exist (Connection refused) Summary: The attempt to promote a cold big ledger peer failed due to a connection error. The connection was refused, indicating a possible issue with the availability of the remote node.

Is there any way to proceed? How can we address the issue to resume the chain synchronization process?

astroz777 avatar Apr 23 '24 22:04 astroz777

I'd like to provide additional information regarding an issue I encountered. It appears that whenever I attempt to reconnect with peers, my network experiences some form of restriction or blockage. Consequently, the connection becomes significantly slow. Yesterday, I closed the node and reopened it today. Initially, it functioned smoothly for approximately three minutes, during which time it successfully downloaded and validated 92,538 slots at a rate of 514 per second. However, shortly thereafter, the connection error with peers resurfaced, causing a drastic slowdown in both download and validation speeds to just one slot per second.

Furthermore, I should mention that I am utilizing a Ryzen 5 5600X processor and 32 GB of RAM. Below, I will include the logs from the last slot validated during the "fast mode" until the subsequent slot, which experienced a slowdown.

2024-04-24 22:36:50 [8c5f5845:cardano.node.GSM:Info:146] [2024-04-25 01:36:50.22 UTC] fromList [("currentSelection",Object (fromList [("kind",String "Tip"),("tipBlockNo",Number 1.0230401e7),("tipHash",String "722d22bb1f25e440d57b5e453d977a01f3be4960d90091420cea4aa0ce9dc254"),("tipSlotNo",Number 1.22442706e8)])),("kind",String "GsmEventEnterCaughtUp"),("peerNumber",Number 2.0)]
2024-04-24 22:36:50 [8c5f5845:cardano.node.PeerSelection:Notice:160] [2024-04-25 01:36:50.22 UTC] TraceLedgerStateJudgementChanged YoungEnough
2024-04-24 22:36:50 [8c5f5845:cardano.node.PeerSelectionCounters:Info:160] [2024-04-25 01:36:50.22 UTC] PeerSelectionCounters {coldPeers = 0, warmPeers = 1, hotPeers = 2, coldBigLedgerPeers = 0, warmBigLedgerPeers = 0, hotBigLedgerPeers = 0, localRoots = []}
2024-04-24 22:36:50 [8c5f5845:cardano.node.PeerSelection:Info:160] [2024-04-25 01:36:50.22 UTC] TraceBigLedgerPeersRequest 15 0
2024-04-24 22:36:50 [8c5f5845:cardano.node.LedgerPeers:Info:158] [2024-04-25 01:36:50.22 UTC] [String "FetchingNewLedgerState",Number 529.0,Number 2842.0]
2024-04-24 22:36:50 [8c5f5845:cardano.node.LedgerPeers:Info:158] [2024-04-25 01:36:50.25 UTC] [String "RelayAccessAddress \"13.208.79.46\" 6000",String "PickedBigLedgerPeer",Number 8.216508481077716e-4]
2024-04-24 22:36:50 [8c5f5845:cardano.node.LedgerPeers:Info:158] [2024-04-25 01:36:50.25 UTC] [String "RelayAccessDomain \"relays.ektrp.com\" 5859",String "PickedBigLedgerPeer",Number 1.5988289600208803e-3]
2024-04-24 22:36:50 [8c5f5845:cardano.node.LedgerPeers:Info:158] [2024-04-25 01:36:50.25 UTC] [String "RelayAccessAddress \"52.167.20.127\" 4000",String "PickedBigLedgerPeer",Number 9.952803016943504e-4]
2024-04-24 22:36:50 [8c5f5845:cardano.node.LedgerPeers:Info:158] [2024-04-25 01:36:50.25 UTC] [String "RelayAccessAddress \"3.228.183.84\" 3001",String "PickedBigLedgerPeer",Number 1.477863825118689e-3]
2024-04-24 22:36:50 [8c5f5845:cardano.node.LedgerPeers:Info:158] [2024-04-25 01:36:50.25 UTC] [String "RelayAccessDomain \"cardano-relay-1.upbit.com\" 30800",String "PickedBigLedgerPeer",Number 1.213966274671122e-3]
2024-04-24 22:36:50 [8c5f5845:cardano.node.LedgerPeers:Info:158] [2024-04-25 01:36:50.25 UTC] [String "RelayAccessAddress \"52.177.36.96\" 3000",String "PickedBigLedgerPeer",Number 2.0126865838078914e-3]
2024-04-24 22:36:50 [8c5f5845:cardano.node.LedgerPeers:Info:158] [2024-04-25 01:36:50.25 UTC] [String "RelayAccessDomain \"ca.relays.cardanians.io\" 1000",String "PickedBigLedgerPeer",Number 3.0810352060967883e-3]
2024-04-24 22:36:50 [8c5f5845:cardano.node.LedgerPeers:Info:158] [2024-04-25 01:36:50.25 UTC] [String "RelayAccessDomain \"r1.isp-r1.wjg.jp\" 3001",String "PickedBigLedgerPeer",Number 9.45254952409071e-4]
2024-04-24 22:36:50 [8c5f5845:cardano.node.LedgerPeers:Info:158] [2024-04-25 01:36:50.25 UTC] [String "RelayAccessAddress \"149.102.140.196\" 6000",String "PickedBigLedgerPeer",Number 3.1429371641176667e-3]
2024-04-24 22:36:50 [8c5f5845:cardano.node.LedgerPeers:Info:158] [2024-04-25 01:36:50.25 UTC] [String "RelayAccessAddress \"34.249.11.89\" 3001",String "PickedBigLedgerPeer",Number 6.547682720921369e-4]
2024-04-24 22:36:50 [8c5f5845:cardano.node.LedgerPeers:Info:158] [2024-04-25 01:36:50.25 UTC] [String "RelayAccessDomain \"cardano-relays-2.nu.fi\" 3001",String "PickedBigLedgerPeer",Number 9.85369740943107e-4]
2024-04-24 22:36:50 [8c5f5845:cardano.node.LedgerPeers:Info:158] [2024-04-25 01:36:50.25 UTC] [String "RelayAccessAddress \"3.222.153.137\" 3001",String "PickedBigLedgerPeer",Number 1.4786498276707937e-3]
2024-04-24 22:36:50 [8c5f5845:cardano.node.LedgerPeers:Info:158] [2024-04-25 01:36:50.25 UTC] [String "RelayAccessAddress \"54.36.178.85\" 6000",String "PickedBigLedgerPeer",Number 2.773403186912947e-3]
2024-04-24 22:36:50 [8c5f5845:cardano.node.LedgerPeers:Info:158] [2024-04-25 01:36:50.25 UTC] [String "RelayAccessDomain \"relay1.ada-stake.com\" 3001",String "PickedBigLedgerPeer",Number 1.06290992637487e-3]
2024-04-24 22:36:50 [8c5f5845:cardano.node.LedgerPeers:Info:158] [2024-04-25 01:36:50.25 UTC] [String "RelayAccessDomain \"relay1.kaizn.kaizencrypto.com\" 6000",String "PickedBigLedgerPeer",Number 1.2318284900528509e-3]
2024-04-24 22:36:50 [8c5f5845:cardano.node.LedgerPeers:Info:158] [2024-04-25 01:36:50.25 UTC] [String "[RelayAccessDomain \"relay1.kaizn.kaizencrypto.com\" 6000,RelayAccessDomain \"relay1.ada-stake.com\" 3001,RelayAccessAddress \"54.36.178.85\" 6000,RelayAccessAddress \"3.222.153.137\" 3001,RelayAccessDomain \"cardano-relays-2.nu.fi\" 3001,RelayAccessAddress \"34.249.11.89\" 3001,RelayAccessAddress \"149.102.140.196\" 6000,RelayAccessDomain \"r1.isp-r1.wjg.jp\" 3001,RelayAccessDomain \"ca.relays.cardanians.io\" 1000,RelayAccessAddress \"52.177.36.96\" 3000,RelayAccessDomain \"cardano-relay-1.upbit.com\" 30800,RelayAccessAddress \"3.228.183.84\" 3001,RelayAccessAddress \"52.167.20.127\" 4000,RelayAccessDomain \"relays.ektrp.com\" 5859,RelayAccessAddress \"13.208.79.46\" 6000]",Number 15.0,Number 15.0,String "PickedBigLedgerPeers"]
2024-04-24 22:36:51 [8c5f5845:cardano.node.LedgerPeers:Info:158] [2024-04-25 01:36:51.50 UTC] [String "UseLedgerPeers",Number 1.16812831e8]
2024-04-24 22:36:51 [8c5f5845:cardano.node.PeerSelection:Info:160] [2024-04-25 01:36:51.50 UTC] TraceBigLedgerPeersResults (fromList [51.89.97.27:1000,52.177.36.96:3000,34.147.178.39:3001,207.154.243.52:3001,3.228.183.84:3001,34.249.11.89:3001,3.222.153.137:3001,62.112.8.253:3001,52.167.20.127:4000,194.99.21.198:5859,13.208.79.46:6000,54.36.178.85:6000,149.102.140.196:6000,54.38.153.200:6000,15.164.112.189:30800]) 0 5s
2024-04-24 22:36:51 [8c5f5845:cardano.node.PeerSelectionCounters:Info:160] [2024-04-25 01:36:51.50 UTC] PeerSelectionCounters {coldPeers = 0, warmPeers = 1, hotPeers = 2, coldBigLedgerPeers = 15, warmBigLedgerPeers = 0, hotBigLedgerPeers = 0, localRoots = []}
2024-04-24 22:36:51 [8c5f5845:cardano.node.PeerSelection:Info:160] [2024-04-25 01:36:51.50 UTC] TracePromoteColdBigLedgerPeers 10 0 (fromList [52.177.36.96:3000,34.147.178.39:3001,207.154.243.52:3001,3.228.183.84:3001,34.249.11.89:3001,62.112.8.253:3001,194.99.21.198:5859,13.208.79.46:6000,54.36.178.85:6000,149.102.140.196:6000])
2024-04-24 22:36:51 [8c5f5845:cardano.node.ConnectionManager:Info:482] [2024-04-25 01:36:51.50 UTC] TrConnectionManagerCounters (ConnectionManagerCounters {fullDuplexConns = 0, duplexConns = 3, unidirectionalConns = 0, inboundConns = 0, outboundConns = 3})
2024-04-24 22:36:51 [8c5f5845:cardano.node.ConnectionManager:Info:483] [2024-04-25 01:36:51.50 UTC] TrConnectionManagerCounters (ConnectionManagerCounters {fullDuplexConns = 0, duplexConns = 3, unidirectionalConns = 0, inboundConns = 0, outboundConns = 3})
2024-04-24 22:36:51 [8c5f5845:cardano.node.ConnectionManager:Info:485] [2024-04-25 01:36:51.50 UTC] TrConnectionManagerCounters (ConnectionManagerCounters {fullDuplexConns = 0, duplexConns = 3, unidirectionalConns = 0, inboundConns = 0, outboundConns = 3})
2024-04-24 22:36:51 [8c5f5845:cardano.node.ConnectionManager:Info:484] [2024-04-25 01:36:51.50 UTC] TrConnectionManagerCounters (ConnectionManagerCounters {fullDuplexConns = 0, duplexConns = 3, unidirectionalConns = 0, inboundConns = 0, outboundConns = 3})
2024-04-24 22:36:51 [8c5f5845:cardano.node.ConnectionManager:Info:486] [2024-04-25 01:36:51.50 UTC] TrConnectionManagerCounters (ConnectionManagerCounters {fullDuplexConns = 0, duplexConns = 3, unidirectionalConns = 0, inboundConns = 0, outboundConns = 3})
2024-04-24 22:36:51 [8c5f5845:cardano.node.ConnectionManager:Info:488] [2024-04-25 01:36:51.50 UTC] TrConnectionManagerCounters (ConnectionManagerCounters {fullDuplexConns = 0, duplexConns = 3, unidirectionalConns = 0, inboundConns = 0, outboundConns = 3})
2024-04-24 22:36:51 [8c5f5845:cardano.node.ConnectionManager:Info:487] [2024-04-25 01:36:51.50 UTC] TrConnectionManagerCounters (ConnectionManagerCounters {fullDuplexConns = 0, duplexConns = 3, unidirectionalConns = 0, inboundConns = 0, outboundConns = 3})
2024-04-24 22:36:51 [8c5f5845:cardano.node.ConnectionManager:Info:489] [2024-04-25 01:36:51.50 UTC] TrConnectionManagerCounters (ConnectionManagerCounters {fullDuplexConns = 0, duplexConns = 3, unidirectionalConns = 0, inboundConns = 0, outboundConns = 3})
2024-04-24 22:36:51 [8c5f5845:cardano.node.ConnectionManager:Info:491] [2024-04-25 01:36:51.50 UTC] TrConnectionManagerCounters (ConnectionManagerCounters {fullDuplexConns = 0, duplexConns = 3, unidirectionalConns = 0, inboundConns = 0, outboundConns = 3})
2024-04-24 22:36:51 [8c5f5845:cardano.node.ConnectionManager:Info:490] [2024-04-25 01:36:51.50 UTC] TrConnectionManagerCounters (ConnectionManagerCounters {fullDuplexConns = 0, duplexConns = 3, unidirectionalConns = 0, inboundConns = 0, outboundConns = 3})
2024-04-24 22:36:51 [8c5f5845:cardano.node.ConnectionManager:Info:490] [2024-04-25 01:36:51.72 UTC] TrConnectionManagerCounters (ConnectionManagerCounters {fullDuplexConns = 0, duplexConns = 3, unidirectionalConns = 0, inboundConns = 0, outboundConns = 4})
2024-04-24 22:36:51 [8c5f5845:cardano.node.Handshake:Info:492] [2024-04-25 01:36:51.72 UTC] WithMuxBearer (ConnectionId {localAddress = 172.17.0.2:3001, remoteAddress = 54.36.178.85:6000}) Send (ClientAgency TokPropose,MsgProposeVersions (fromList [(NodeToNodeV_7,TList [TInt 764824073,TBool False]),(NodeToNodeV_8,TList [TInt 764824073,TBool False]),(NodeToNodeV_9,TList [TInt 764824073,TBool False]),(NodeToNodeV_10,TList [TInt 764824073,TBool False]),(NodeToNodeV_11,TList [TInt 764824073,TBool False,TInt 0,TBool False]),(NodeToNodeV_12,TList [TInt 764824073,TBool False,TInt 0,TBool False]),(NodeToNodeV_13,TList [TInt 764824073,TBool False,TInt 0,TBool False])]))
2024-04-24 22:36:51 [8c5f5845:cardano.node.ConnectionManager:Info:483] [2024-04-25 01:36:51.73 UTC] TrConnectionManagerCounters (ConnectionManagerCounters {fullDuplexConns = 0, duplexConns = 3, unidirectionalConns = 0, inboundConns = 0, outboundConns = 5})
2024-04-24 22:36:51 [8c5f5845:cardano.node.Handshake:Info:493] [2024-04-25 01:36:51.73 UTC] WithMuxBearer (ConnectionId {localAddress = 172.17.0.2:3001, remoteAddress = 34.147.178.39:3001}) Send (ClientAgency TokPropose,MsgProposeVersions (fromList [(NodeToNodeV_7,TList [TInt 764824073,TBool False]),(NodeToNodeV_8,TList [TInt 764824073,TBool False]),(NodeToNodeV_9,TList [TInt 764824073,TBool False]),(NodeToNodeV_10,TList [TInt 764824073,TBool False]),(NodeToNodeV_11,TList [TInt 764824073,TBool False,TInt 0,TBool False]),(NodeToNodeV_12,TList [TInt 764824073,TBool False,TInt 0,TBool False]),(NodeToNodeV_13,TList [TInt 764824073,TBool False,TInt 0,TBool False])]))
2024-04-24 22:36:51 [8c5f5845:cardano.node.ConnectionManager:Info:487] [2024-04-25 01:36:51.73 UTC] TrConnectionManagerCounters (ConnectionManagerCounters {fullDuplexConns = 0, duplexConns = 3, unidirectionalConns = 0, inboundConns = 0, outboundConns = 6})```

astroz777 avatar Apr 25 '24 01:04 astroz777

Thanks, @astroz777, for your report. Could you check if the useLedgerAfterSlot in the topology.json file doesn't point to a too old slot? If the node tries to use the ledger peers too early, it will get outdated information for the ledger about relays that are no longer maintained.

coot avatar Apr 25 '24 08:04 coot

Hello! @coot Thanks for the help. I'm using the default topology from the Docker image. Should I change it to another one, or just increase the useLedgerAfterSlot number to a higher value?"

{"bootstrapPeers":[{"address":"backbone.cardano.iog.io","port":3001},{"address":"backbone.mainnet.emurgornd.com","port":3001}],"localRoots":[],"publicRoots":[],"useLedgerAfterSlot":116812831}

astroz777 avatar Apr 25 '24 13:04 astroz777

You're useLedgerAfterSlot seems fine. Could you share the logs when the incident happened? You can send them privately to my email [email protected] or on matrix @the-coot:matrix.org.

coot avatar Apr 26 '24 10:04 coot

It seems I'm encountering a similar issue. However, my logs are different. here's my logs

[sanchone:cardano.node.LocalInboundGovernor:Info:60] [2024-05-17 04:40:30.12 UTC] TrMuxErrored (ConnectionId {localAddress = LocalAddress "/opt/cardano/cnode/sockets/node.socket", remoteAddress = LocalAddress "/opt/cardano/cnode/sockets/node.socket@236"}) (MuxError MuxBearerClosed "<socket: 93> closed when reading data, waiting on next header True")
[sanchone:cardano.node.LocalConnectionManager:Info:2500] [2024-05-17 04:40:30.12 UTC] TrConnectionHandler (ConnectionId {localAddress = LocalAddress "/opt/cardano/cnode/sockets/node.socket", remoteAddress = LocalAddress "/opt/cardano/cnode/sockets/node.socket@236"}) (TrConnectionHandlerError InboundError (MuxError MuxBearerClosed "<socket: 93> closed when reading data, waiting on next header True") ShutdownPeer)

ironmask285 avatar May 17 '24 04:05 ironmask285

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

github-actions[bot] avatar Jun 17 '24 01:06 github-actions[bot]

@ironmask285 <socket: 93> closed when reading data, waiting on next header True - this means the remote closed the connection. There's nothing wrong with that.

I formatted your message with triple quotes just for readability.

coot avatar Jul 01 '24 10:07 coot

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

github-actions[bot] avatar Aug 01 '24 01:08 github-actions[bot]