snarkOS icon indicating copy to clipboard operation
snarkOS copied to clipboard

[Bug] Abnormal disconnection caused by asynchronous Ping message

Open elderhammer opened this issue 1 year ago • 5 comments

Steps to Reproduce

1.Receive a Pong message from the peer and create an asynchronous task to send a Ping message to the peer after PING_SLEEP_IN_SECS seconds.

2024-06-25T07:25:55.802800Z TRACE snarkos_node_router::inbound: Received 'Pong' from '127.0.0.1:4132'

https://github.com/AleoNet/snarkOS/blob/09aa62b4b877c5e96689ba2e0c77d5fd1c69ae7c/node/src/client/router.rs#L248-L266 2.Periodic refresh of peers.

2024-06-25T07:26:07.870256Z  INFO snarkos_node_router::heartbeat: Disconnecting from '127.0.0.1:4132' (periodic refresh of peers)
2024-06-25T07:26:07.870268Z TRACE snarkos_node_router::outbound: Sending 'Disconnect' to '127.0.0.1:4132'

3.Receive a connection request from a peer

2024-06-25T07:26:15.775172Z DEBUG tcp{name="0"}: snarkos_node_tcp::tcp: Received a connection from 127.0.0.1:51296

4.Complete the handshake, and add the peer to the router.

2024-06-25T07:26:15.775550Z DEBUG tcp{name="0"}: snarkos_node_tcp::protocols::handshake: shaking hands with 127.0.0.1:51296 as the Responder
2024-06-25T07:26:15.775572Z DEBUG snarkos_node_router::handshake: Received a connection request from '127.0.0.1:51296'
2024-06-25T07:26:15.789750Z TRACE snarkos_node_router::handshake: Received 'ChallengeRequest' from '127.0.0.1:51296'
2024-06-25T07:26:15.790940Z TRACE snarkos_node_router::handshake: Sending 'ChallengeResponse' to '127.0.0.1:51296'
2024-06-25T07:26:15.791066Z TRACE snarkos_node_router::handshake: Sending 'ChallengeRequest' to '127.0.0.1:51296'
2024-06-25T07:26:15.801582Z TRACE snarkos_node_router::handshake: Received 'ChallengeResponse' from '127.0.0.1:51296'
2024-06-25T07:26:15.804149Z DEBUG tcp{name="0"}: snarkos_node_tcp::protocols::handshake: successfully handshaken with 127.0.0.1:51296

https://github.com/AleoNet/snarkOS/blob/09aa62b4b877c5e96689ba2e0c77d5fd1c69ae7c/node/router/src/handshake.rs#L275-L276 5.Create asynchronous tasks for processing, reading, and writing

2024-06-25T07:26:15.804489Z TRACE tcp{name="0"}: snarkos_node_tcp::protocols::reading: spawned a task for processing messages from 127.0.0.1:51296

6.Asynchronous task sleep ends, send Ping message

2024-06-25T07:26:15.804961Z TRACE snarkos_node_router::outbound: Sending 'Ping' to '127.0.0.1:4132'

7.Because the peer has been added to the router, can_send returns true and the Ping message is sent. However, because the writing asynchronous task has not been created, the sending fails.

2024-06-25T07:26:15.804980Z  WARN snarkos_node_router::outbound: Failed to send 'Ping' to '127.0.0.1:4132': not connected
2024-06-25T07:26:15.804988Z DEBUG snarkos_node_router::outbound: Disconnecting from '127.0.0.1:4132' (unable to send)
2024-06-25T07:26:15.805060Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 127.0.0.1:51296
2024-06-25T07:26:15.805072Z  WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)

https://github.com/AleoNet/snarkOS/blob/09aa62b4b877c5e96689ba2e0c77d5fd1c69ae7c/node/router/src/outbound.rs#L48-L59 8.The reading and writing asynchronous tasks are created later.

2024-06-25T07:26:15.805089Z TRACE tcp{name="0"}: snarkos_node_tcp::protocols::reading: spawned a task for reading messages from 127.0.0.1:51296
2024-06-25T07:26:15.805254Z TRACE tcp{name="0"}: snarkos_node_tcp::protocols::writing: spawned a task for writing messages to 127.0.0.1:51296

This will cause two problems:

  • Asynchronous tasks such as processing, reading, and writing are not terminated correctly
  • Because the peer believes that the connection is complete, it will reject subsequent connections from the current node
2024-06-25T07:26:32.872315Z DEBUG tcp{name="1"}: snarkos_node_tcp::tcp: Received a connection from 127.0.0.1:57797
2024-06-25T07:26:32.872716Z DEBUG tcp{name="1"}: snarkos_node_tcp::protocols::handshake: shaking hands with 127.0.0.1:57797 as the Responder
2024-06-25T07:26:32.872755Z DEBUG snarkos_node_router::handshake: Received a connection request from '127.0.0.1:57797'
2024-06-25T07:26:32.873158Z TRACE snarkos_node_router::handshake: Received 'ChallengeRequest' from '127.0.0.1:57797'
2024-06-25T07:26:32.873211Z ERROR tcp{name="1"}: snarkos_node_tcp::protocols::handshake: handshake with 127.0.0.1:57797 failed: Dropping connection request from '127.0.0.1:4202' (already connected)
2024-06-25T07:26:32.873252Z ERROR tcp{name="1"}: snarkos_node_tcp::tcp: Failed to connect with 127.0.0.1:57797: Dropping connection request from '127.0.0.1:4202' (already connected)

Expected Behavior

This shouldn't happen

Your Environment

snarkOS Version: 09aa62b4b877c5e96689ba2e0c77d5fd1c69ae7c

elderhammer avatar Jun 25 '24 09:06 elderhammer

@ljedrz might be of interest to you

elderhammer avatar Jun 25 '24 10:06 elderhammer

Very cool find; I think I know how to tackle this.

ljedrz avatar Jun 25 '24 10:06 ljedrz

Ah yes, it seems that https://github.com/AleoNet/snarkOS/pull/2690 will need to be reopened; a workaround where we would abort ping tasks on disconnect would also work, but the real solution is to delay a peer's promotion to the connected status.

ljedrz avatar Jun 25 '24 10:06 ljedrz

On testnet beta, this issue occurs frequently, caused by the UnconfirmedTransaction and UnconfirmedSolution messages.

2024-07-04T10:41:42.069597Z  WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedTransaction' to '89.116.29.150:4130': not connected
2024-07-04T10:41:42.069659Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 89.116.29.150:58433
2024-07-04T10:41:42.069667Z  WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)

2024-07-04T13:19:33.127551Z  WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedTransaction' to '211.171.42.223:4130': not connected
2024-07-04T13:19:33.127618Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 211.171.42.223:55251
2024-07-04T13:19:33.127629Z  WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)

2024-07-04T19:04:50.829665Z  WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedSolution' to '157.66.54.66:4130': not connected
2024-07-04T19:04:50.829711Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 157.66.54.66:56847
2024-07-04T19:04:50.829719Z  WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)

2024-07-04T23:09:06.395996Z  WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedSolution' to '118.140.26.162:4130': not connected
2024-07-04T23:09:06.396038Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 118.140.26.162:47970
2024-07-04T23:09:06.396046Z  WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)

2024-07-05T01:29:44.276617Z  WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedTransaction' to '104.237.2.224:4130': not connected      
2024-07-05T01:29:44.276679Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 104.237.2.224:4130                
2024-07-05T01:29:44.276691Z  WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)  

2024-07-05T01:36:28.543874Z  WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedTransaction' to '34.17.53.129:4130': not connected
2024-07-05T01:36:28.544005Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 34.17.53.129:4130
2024-07-05T01:36:28.544012Z  WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)
2024-07-05T01:36:32.919219Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Already connected to 34.17.53.129:4130
2024-07-05T01:36:32.919232Z  WARN snarkos_node_router: Unable to connect to '34.17.53.129:4130' - entity already exists
2024-07-05T01:36:37.919752Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Already connected to 34.17.53.129:4130
2024-07-05T01:36:37.919782Z  WARN snarkos_node_router: Unable to connect to '34.17.53.129:4130' - entity already exists
2024-07-05T01:36:42.920965Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Already connected to 34.17.53.129:4130
2024-07-05T01:36:42.920990Z  WARN snarkos_node_router: Unable to connect to '34.17.53.129:4130' - entity already exists
2024-07-05T01:36:47.921713Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Already connected to 34.17.53.129:4130
2024-07-05T01:36:47.921728Z  WARN snarkos_node_router: Unable to connect to '34.17.53.129:4130' - entity already exists

elderhammer avatar Jul 05 '24 06:07 elderhammer

https://github.com/AleoNet/snarkOS/pull/3332 should fix this.

ljedrz avatar Jul 10 '24 16:07 ljedrz