Target node of NetworkStartStopInterface nemesis lost connectivity with other cluster nodes after stop/start of a network interface
Installation details
Kernel Version: 5.15.0-1019-aws
Scylla version (or git commit hash): 5.1.0~rc1-20220902.d10aee15e7e9 with build-id c127c717ecffa082ce97b94100d62b2549abe486
Relocatable Package: http://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-5.1/scylla-x86_64-package-5.1.0-rc1.0.20220902.d10aee15e7e9.tar.gz
Cluster size: 6 nodes (i3.2xlarge)
Scylla Nodes used in this run:
- longevity-200gb-48h-network-2022-2-db-node-a8763342-6 (52.50.122.8 | 10.4.2.245) (shards: 7)
- longevity-200gb-48h-network-2022-2-db-node-a8763342-5 (34.254.31.220 | 10.4.3.132) (shards: 7)
- longevity-200gb-48h-network-2022-2-db-node-a8763342-4 (3.250.254.20 | 10.4.3.7) (shards: 7)
- longevity-200gb-48h-network-2022-2-db-node-a8763342-3 (63.34.238.51 | 10.4.2.46) (shards: 7)
- longevity-200gb-48h-network-2022-2-db-node-a8763342-2 (52.31.220.37 | 10.4.1.211) (shards: 7)
- longevity-200gb-48h-network-2022-2-db-node-a8763342-1 (54.77.2.178 | 10.4.3.65) (shards: 7)
OS / Image: ami-03f1a2d7537cf2e31 (aws: eu-west-1)
Test: longevity-200gb-48h-network-monkey-test
Test id: a8763342-4c58-4f97-849c-91f14d075d82
Test name: enterprise-2022.2/longevity/longevity-200gb-48h-network-monkey-test
Test config file(s):
Issue description
At 14:19:20 eth1 of node-2 brought down and at 14:20:20 returned up back:
2022-09-18T14:19:20+00:00 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 !NOTICE | sudo[13483]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/sbin/ip link set eth1 down
2022-09-18T14:19:20+00:00 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 !INFO | systemd-networkd[424]: eth1: Link DOWN
2022-09-18T14:19:20+00:00 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 !INFO | systemd-networkd[424]: eth1: Lost carrier
2022-09-18T14:20:20+00:00 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 !NOTICE | sudo[13504]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/sbin/ip link set eth1 up
2022-09-18T14:20:20+00:00 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 !INFO | systemd-networkd[424]: eth1: Link UP
2022-09-18T14:20:20+00:00 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 !INFO | kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
2022-09-18T14:20:20+00:00 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 !INFO | systemd-networkd[424]: eth1: Gained carrier
But Scylla on this node was not able to restore connectivity w/ other nodes:
$ /usr/bin/nodetool -u cassandra -pw 'cassandra'
Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.4.3.132 162.72 GB 256 ? 6c58c3f8-92e2-42f8-8235-01e989d86a69 1a
UN 10.4.2.245 160.35 GB 256 ? fddf7a4d-219f-4d41-9320-744386cca8a5 1a
UN 10.4.3.7 161.3 GB 256 ? 6c9e7b70-1bb1-4e34-940e-97e5303cd4c4 1a
UN 10.4.3.65 161.89 GB 256 ? c2540cc5-a3e0-4b00-9258-c925ed6bc68b 1a
DN 10.4.1.211 156.63 GB 256 ? e616e82a-6cb8-423d-a6b6-9a43c68f5239 1a
UN 10.4.2.46 157.83 GB 256 ? baf5e471-f458-4eeb-9a95-3d509a7bc696 1a
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
But on node-2:
$ /usr/bin/nodetool -u cassandra -pw 'cassandra'
Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
DN 10.4.3.132 156.11 GB 256 ? 6c58c3f8-92e2-42f8-8235-01e989d86a69 1a
DN 10.4.2.245 152.54 GB 256 ? fddf7a4d-219f-4d41-9320-744386cca8a5 1a
DN 10.4.3.7 155.06 GB 256 ? 6c9e7b70-1bb1-4e34-940e-97e5303cd4c4 1a
DN 10.4.3.65 155.23 GB 256 ? c2540cc5-a3e0-4b00-9258-c925ed6bc68b 1a
UN 10.4.1.211 159.99 GB 256 ? e616e82a-6cb8-423d-a6b6-9a43c68f5239 1a
DN 10.4.2.46 149.72 GB 256 ? baf5e471-f458-4eeb-9a95-3d509a7bc696 1a
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
Also there are following messages in Scylla' log on node-2:
Sep 18 14:20:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 90 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:22:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 180 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:23:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 270 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:25:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 360 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:26:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 450 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:28:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 540 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:29:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 630 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:31:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 720 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:32:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 810 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:34:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 900 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:35:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 990 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:37:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 1080 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:38:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 1170 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:40:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 1260 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:41:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 1350 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:43:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 1440 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:44:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 1530 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:46:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 1620 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:47:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 1710 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:49:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 1800 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:50:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 1890 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:52:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 1980 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:53:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 2070 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:55:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 2160 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:56:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 2250 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:58:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 2340 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:59:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 2430 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 15:01:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than 2520 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
- Restore Monitor Stack command:
$ hydra investigate show-monitor a8763342-4c58-4f97-849c-91f14d075d82 - Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs a8763342-4c58-4f97-849c-91f14d075d82
Logs:
- db-cluster-a8763342.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a8763342-4c58-4f97-849c-91f14d075d82/20220918_150136/db-cluster-a8763342.tar.gz
- monitor-set-a8763342.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a8763342-4c58-4f97-849c-91f14d075d82/20220918_150136/monitor-set-a8763342.tar.gz
- loader-set-a8763342.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a8763342-4c58-4f97-849c-91f14d075d82/20220918_150136/loader-set-a8763342.tar.gz
- sct-runner-a8763342.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a8763342-4c58-4f97-849c-91f14d075d82/20220918_150136/sct-runner-a8763342.tar.gz
According to QA this is not a new test and previous releases it worked (verified on 5.0)
From logs:
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 4] gossip - failure_detector_loop: Send echo to node 10.4.3.65, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 4] gossip - failure_detector_loop: Mark node 10.4.3.65 as DOWN
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] gossip - InetAddress 10.4.3.65 is now DOWN, status = NORMAL
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 2] gossip - failure_detector_loop: Send echo to node 10.4.3.132, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 2] gossip - failure_detector_loop: Mark node 10.4.3.132 as DOWN
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] gossip - InetAddress 10.4.3.132 is now DOWN, status = NORMAL
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 1] gossip - failure_detector_loop: Send echo to node 10.4.2.46, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 1] gossip - failure_detector_loop: Mark node 10.4.2.46 as DOWN
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] gossip - InetAddress 10.4.2.46 is now DOWN, status = NORMAL
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 3] gossip - failure_detector_loop: Send echo to node 10.4.3.7, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 3] gossip - failure_detector_loop: Mark node 10.4.3.7 as DOWN
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] gossip - InetAddress 10.4.3.7 is now DOWN, status = NORMAL
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 3] rpc - client 10.4.3.7:7001: client connection dropped: read: Connection reset by peer
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] gossip - failure_detector_loop: Send echo to node 10.4.2.245, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] gossip - failure_detector_loop: Mark node 10.4.2.245 as DOWN
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]: [shard 0] gossip - InetAddress 10.4.2.245 is now DOWN, status = NORMAL
and no more messages from gossiper since then
Actually -- there are only messages from compaction, commitlog shutdown and some LeveledManifest since this time. Nothing more
@asias , how is gossiper supposed to bring nodes back to UP state? Does it wait for a message from them or pings the nodes on his own?
@enaydanov , can we re-run this with --logger-log-level gossip=debug ?
@xemul yep, I've started a run: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/evgeniy/job/longevity-200gb-48h-network-monkey-test/2/
Failed too:
Target node for NetworkStartStopInterface is node-4 that time:
2022-09-22T16:21:34+00:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 !NOTICE | sudo[13114]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/sbin/ip link set eth1 down
2022-09-22T16:21:34+00:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 !INFO | systemd-networkd[426]: eth1: Link DOWN
2022-09-22T16:21:35+00:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 !INFO | systemd-networkd[426]: eth1: Lost carrier
2022-09-22T16:29:55+00:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 !NOTICE | sudo[13233]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/sbin/ip link set eth1 up
2022-09-22T16:29:55+00:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 !INFO | systemd-networkd[426]: eth1: Link UP
2022-09-22T16:29:55+00:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 !INFO | kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
2022-09-22T16:29:55+00:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 !INFO | systemd-networkd[426]: eth1: Gained carrier
Gossip logs on this node at this time:
...
Sep 22 16:21:34 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Set live nodes to talk: endpoint_state_map=6, all_live_nodes=[10.4.1.145, 10.4.2.154, 10.4.3.196, 10.4.0.248, 10.4.2.225], endpoints_to_talk_with=[{10.4.1.145}, {10.4.2.154}, {10.4.3.196}, {10.4.0.248}, {10.4.2.225}]
Sep 22 16:21:34 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.1.145}
Sep 22 16:21:34 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - apply_state_locally_endpoints=[10.4.0.248, 10.4.1.145, 10.4.2.225, 10.4.2.154]
Sep 22 16:21:34 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - apply_state_locally() took 1 ms
Sep 22 16:21:34 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Process gossip ack msg digests from node 10.4.1.145:0, ack_msg_digest=[10.4.2.247:1663856208:7836]
Sep 22 16:21:34 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Calling do_send_ack2_msg to node 10.4.1.145:0, ack_msg_digest=[10.4.2.247:1663856208:7836], ack2_msg=endpoint_state:{[10.4.2.247->HeartBeatState = { generation = 1663856208, version = 7837 }, AppStateMap =]}
Sep 22 16:21:34 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - No more queued gossip ack msg digests from node 10.4.1.145:0, ack_msg_digest={}, pending=false
Sep 22 16:21:35 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.2.154}
Sep 22 16:21:35 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 1] gossip - failure_detector_loop: Send echo to node 10.4.1.145, status = started
Sep 22 16:21:35 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 4] gossip - failure_detector_loop: Send echo to node 10.4.3.196, status = started
Sep 22 16:21:35 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 2] gossip - failure_detector_loop: Send echo to node 10.4.2.154, status = started
Sep 22 16:21:35 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - failure_detector_loop: Send echo to node 10.4.0.248, status = started
Sep 22 16:21:36 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 3] gossip - failure_detector_loop: Send echo to node 10.4.2.225, status = started
Sep 22 16:21:36 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.3.196}
Sep 22 16:21:37 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.0.248}
Sep 22 16:21:38 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.2.225}
Sep 22 16:21:39 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Set live nodes to talk: endpoint_state_map=6, all_live_nodes=[10.4.3.196, 10.4.0.248, 10.4.2.154, 10.4.2.225, 10.4.1.145], endpoints_to_talk_with=[{10.4.3.196}, {10.4.0.248}, {10.4.2.154}, {10.4.2.225}, {10.4.1.145}]
Sep 22 16:21:39 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.3.196}
Sep 22 16:21:40 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.0.248}
Sep 22 16:21:41 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.2.154}
Sep 22 16:21:42 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.2.225}
Sep 22 16:21:43 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.1.145}
Sep 22 16:21:44 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Set live nodes to talk: endpoint_state_map=6, all_live_nodes=[10.4.2.154, 10.4.1.145, 10.4.2.225, 10.4.0.248, 10.4.3.196], endpoints_to_talk_with=[{10.4.2.154}, {10.4.1.145}, {10.4.2.225}, {10.4.0.248}, {10.4.3.196}]
Sep 22 16:21:44 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.2.154}
Sep 22 16:21:45 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.1.145}
Sep 22 16:21:46 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.2.225}
Sep 22 16:21:47 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.0.248}
Sep 22 16:21:48 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.3.196}
Sep 22 16:21:49 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Set live nodes to talk: endpoint_state_map=6, all_live_nodes=[10.4.0.248, 10.4.2.225, 10.4.3.196, 10.4.2.154, 10.4.1.145], endpoints_to_talk_with=[{10.4.0.248}, {10.4.2.225}, {10.4.3.196}, {10.4.2.154}, {10.4.1.145}]
Sep 22 16:21:49 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.0.248}
Sep 22 16:21:50 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.2.225}
Sep 22 16:21:51 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.3.196}
Sep 22 16:21:52 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.2.154}
Sep 22 16:21:53 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.1.145}
Sep 22 16:21:54 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Set live nodes to talk: endpoint_state_map=6, all_live_nodes=[10.4.2.225, 10.4.1.145, 10.4.2.154, 10.4.0.248, 10.4.3.196], endpoints_to_talk_with=[{10.4.2.225}, {10.4.1.145}, {10.4.2.154}, {10.4.0.248}, {10.4.3.196}]
Sep 22 16:21:54 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.2.225}
Sep 22 16:21:55 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.1.145}
Sep 22 16:21:56 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.2.154}
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.0.248}
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 1] gossip - failure_detector_loop: Send echo to node 10.4.1.145, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 1] gossip - failure_detector_loop: Mark node 10.4.1.145 as DOWN
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - InetAddress 10.4.1.145 is now DOWN, status = NORMAL
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 4] gossip - failure_detector_loop: Send echo to node 10.4.3.196, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 4] gossip - failure_detector_loop: Mark node 10.4.3.196 as DOWN
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - InetAddress 10.4.3.196 is now DOWN, status = NORMAL
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 2] gossip - failure_detector_loop: Send echo to node 10.4.2.154, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 2] gossip - failure_detector_loop: Mark node 10.4.2.154 as DOWN
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - InetAddress 10.4.2.154 is now DOWN, status = NORMAL
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - failure_detector_loop: Send echo to node 10.4.0.248, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - failure_detector_loop: Mark node 10.4.0.248 as DOWN
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - InetAddress 10.4.0.248 is now DOWN, status = NORMAL
Sep 22 16:21:58 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 3] gossip - failure_detector_loop: Send echo to node 10.4.2.225, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Sep 22 16:21:58 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 3] gossip - failure_detector_loop: Mark node 10.4.2.225 as DOWN
Sep 22 16:21:58 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - InetAddress 10.4.2.225 is now DOWN, status = NORMAL
Sep 22 16:21:58 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - failure_detector_loop: previous_live_nodes=[10.4.0.248, 10.4.1.145, 10.4.2.154, 10.4.2.225, 10.4.3.196], current_live_nodes=[], nodes_down=[10.4.0.248, 10.4.1.145, 10.4.2.154, 10.4.2.225, 10.4.3.196]
Sep 22 16:21:58 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - failure_detector_loop: previous_live_nodes=[10.4.0.248, 10.4.1.145, 10.4.2.154, 10.4.2.225, 10.4.3.196], current_live_nodes=[], nodes_down=[10.4.0.248, 10.4.1.145, 10.4.2.154, 10.4.2.225, 10.4.3.196], version_before=21, version_after=21
Sep 22 16:21:58 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - failure_detector_loop: Wait until live_nodes=[] is not empty
Sep 22 16:21:58 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.3.196}
Sep 22 16:21:59 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - failure_detector_loop: Wait until live_nodes=[] is not empty
Sep 22 16:21:59 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Set live nodes to talk: endpoint_state_map=6, all_live_nodes=[], endpoints_to_talk_with=[]
Sep 22 16:21:59 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - No live nodes yet: try initial contact point nodes={10.4.0.248}
Sep 22 16:21:59 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.0.248}
Sep 22 16:22:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - failure_detector_loop: Wait until live_nodes=[] is not empty
Sep 22 16:22:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Set live nodes to talk: endpoint_state_map=6, all_live_nodes=[], endpoints_to_talk_with=[]
Sep 22 16:22:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - No live nodes yet: try initial contact point nodes={10.4.0.248}
Sep 22 16:22:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.0.248}
Sep 22 16:22:01 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - failure_detector_loop: Wait until live_nodes=[] is not empty
Sep 22 16:22:01 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Set live nodes to talk: endpoint_state_map=6, all_live_nodes=[], endpoints_to_talk_with=[]
Sep 22 16:22:01 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - No live nodes yet: try initial contact point nodes={10.4.0.248}
Sep 22 16:22:01 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.0.248}
Sep 22 16:22:02 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - failure_detector_loop: Wait until live_nodes=[] is not empty
Sep 22 16:22:02 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Set live nodes to talk: endpoint_state_map=6, all_live_nodes=[], endpoints_to_talk_with=[]
Sep 22 16:22:02 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - No live nodes yet: try initial contact point nodes={10.4.0.248}
Sep 22 16:22:02 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Talk to live nodes: {10.4.0.248}
Sep 22 16:22:03 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - failure_detector_loop: Wait until live_nodes=[] is not empty
Sep 22 16:22:03 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]: [shard 0] gossip - Set live nodes to talk: endpoint_state_map=6, all_live_nodes=[], endpoints_to_talk_with=[]
...
Last 4 messages repeated till end of the log.
Installation details
Kernel version: 5.15.0-1019-aws
Scylla version (or git commit hash): 5.1.0~rc1-0.20220902.d10aee15e7e9 with build-id c127c717ecffa082ce97b94100d62b2549abe486
Relocatable package url: http://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-5.1/scylla-x86_64-package-5.1.0-rc1.0.20220902.d10aee15e7e9.tar.gz
Cluster size: 6 nodes (i3.2xlarge)
Scylla running with shards number (live nodes):
longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-1 (34.241.64.42 | 10.4.0.248): 7 shards
longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-2 (34.250.224.180 | 10.4.2.154): 7 shards
longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-3 (18.200.176.129 | 10.4.3.196): 7 shards
longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 (52.213.54.137 | 10.4.2.247): 7 shards
longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-5 (176.34.235.243 | 10.4.1.145): 7 shards
longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-6 (18.200.177.169 | 10.4.2.225): 7 shards
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-03f1a2d7537cf2e31 (aws: eu-west-1)
Test: longevity-200gb-48h-network-monkey-test
Test name: longevity_test.LongevityTest.test_custom_time
Test config file(s):
Restore Monitor Stack command: $ hydra investigate show-monitor ce5ddf02-67cf-4194-9ae3-cc51e1b64a6c
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs ce5ddf02-67cf-4194-9ae3-cc51e1b64a6c
Test id: ce5ddf02-67cf-4194-9ae3-cc51e1b64a6c
Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/ce5ddf02-67cf-4194-9ae3-cc51e1b64a6c/20220922_170051/grafana-screenshot-longevity-200gb-48h-network-monkey-test-scylla-per-server-metrics-nemesis-20220922_170202-longevity-200gb-48h-network-net-mon-monitor-node-ce5ddf02-1.png grafana - https://cloudius-jenkins-test.s3.amazonaws.com/ce5ddf02-67cf-4194-9ae3-cc51e1b64a6c/20220922_170051/grafana-screenshot-overview-20220922_170051-longevity-200gb-48h-network-net-mon-monitor-node-ce5ddf02-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/ce5ddf02-67cf-4194-9ae3-cc51e1b64a6c/20220922_171052/db-cluster-ce5ddf02.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/ce5ddf02-67cf-4194-9ae3-cc51e1b64a6c/20220922_171052/loader-set-ce5ddf02.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/ce5ddf02-67cf-4194-9ae3-cc51e1b64a6c/20220922_171052/monitor-set-ce5ddf02.tar.gz sct - https://cloudius-jenkins-test.s3.amazonaws.com/ce5ddf02-67cf-4194-9ae3-cc51e1b64a6c/20220922_171052/sct-runner-ce5ddf02.tar.gz
But Scylla on this node was not able to restore connectivity w/ other nodes:
this node == which node here? You mentioned node-2, which node you got the following nodetool status
$ /usr/bin/nodetool -u cassandra -pw 'cassandra'
Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.4.3.132 162.72 GB 256 ? 6c58c3f8-92e2-42f8-8235-01e989d86a69 1a
UN 10.4.2.245 160.35 GB 256 ? fddf7a4d-219f-4d41-9320-744386cca8a5 1a
UN 10.4.3.7 161.3 GB 256 ? 6c9e7b70-1bb1-4e34-940e-97e5303cd4c4 1a
UN 10.4.3.65 161.89 GB 256 ? c2540cc5-a3e0-4b00-9258-c925ed6bc68b 1a
DN 10.4.1.211 156.63 GB 256 ? e616e82a-6cb8-423d-a6b6-9a43c68f5239 1a
UN 10.4.2.46 157.83 GB 256 ? baf5e471-f458-4eeb-9a95-3d509a7bc696 1a
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
Is this a new test? Did this test pass previously?
In the test performed here: https://github.com/scylladb/scylladb/issues/11609#issuecomment-1257148947
we stopped the nic on node4 at 2022-09-22T16:21:34+00:00 and resumed the nic at 2022-09-22T16:29:55+00:00. Did we perform any other operations, e.g., ip tables to block specific node, down nic of other nodes?
But Scylla on this node was not able to restore connectivity w/ other nodes:
this node == which node here? You mentioned node-2, which node you got the following nodetool status
$ /usr/bin/nodetool -u cassandra -pw 'cassandra' Datacenter: eu-west =================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.4.3.132 162.72 GB 256 ? 6c58c3f8-92e2-42f8-8235-01e989d86a69 1a UN 10.4.2.245 160.35 GB 256 ? fddf7a4d-219f-4d41-9320-744386cca8a5 1a UN 10.4.3.7 161.3 GB 256 ? 6c9e7b70-1bb1-4e34-940e-97e5303cd4c4 1a UN 10.4.3.65 161.89 GB 256 ? c2540cc5-a3e0-4b00-9258-c925ed6bc68b 1a DN 10.4.1.211 156.63 GB 256 ? e616e82a-6cb8-423d-a6b6-9a43c68f5239 1a UN 10.4.2.46 157.83 GB 256 ? baf5e471-f458-4eeb-9a95-3d509a7bc696 1a Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
Rest nodes (node-1, node-3, etc.)
Is this a new test? Did this test pass previously?
Yes, it passed on 5.0: http://argus.scylladb.com/test/86894c00-21ea-44a7-b6a8-932a2152d343/runs?additionalRuns[]=c8caff93-a98a-4963-b668-4d074b909494
In the test performed here: #11609 (comment)
we stopped the nic on node4 at 2022-09-22T16:21:34+00:00 and resumed the nic at 2022-09-22T16:29:55+00:00. Did we perform any other operations, e.g., ip tables to block specific node, down nic of other nodes?
According to SCT code, no. It should be just a sleep:
https://github.com/scylladb/scylla-cluster-tests/blob/37a53dcd742e0fe996b57ad0f8df7d9349e29e06/sdcm/nemesis.py#L3020
[shard 0] gossip - Talk to live nodes: {10.4.0.248}
logger.debug("Talk to live nodes: {}", live_nodes);
for (auto& ep: live_nodes) {
// Do it in the background.
(void)do_gossip_to_live_member(message, ep).handle_exception([] (auto ep) {
logger.trace("Failed to do_gossip_to_live_member: {}", ep);
});
}
:disappointed: no gossiping errors are ever reported to logs/stats/metrics/whatever
Checked on c5285bcb143a34b9bca6a6d167624237c994052d . After I put ip l s veth1 up gossiper eventually marks the node Up-Normal. There must be something with the setup @enaydanov . Did anything change in test or nodes set-up since then?
Same for d10aee15e7e98354897ee18c58cf8d5fa0677212
@xemul Are you aware of any RPC issue that could make the send of the one way message hang forever in 5.1?
@enaydanov , can we re-run this with --logger-log-level gossip=trace , please?
ping @enaydanov
Started the run: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/evgeniy/job/longevity-200gb-48h-network-monkey-test/3/
@xemul, please, take a look. It's a run with --logger-log-level gossip=trace
Installation details
Kernel version: 5.15.0-1021-aws
Scylla version (or git commit hash): 2022.2.0~rc3-0.20221009.994a5f0fbb4c with build-id 756ea8d62c25ed4acdf087054e11b3d07596a117
Relocatable package url: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-x86_64-package-2022.2.0-rc3.0.20221009.994a5f0fbb4c.tar.gz
Cluster size: 6 nodes (i3.2xlarge)
Scylla running with shards number (live nodes):
longevity-200gb-48h-network-net-mon-db-node-2ae18c72-1 (54.171.53.38 | 10.4.3.39): 7 shards
longevity-200gb-48h-network-net-mon-db-node-2ae18c72-2 (34.254.138.201 | 10.4.3.87): 7 shards
longevity-200gb-48h-network-net-mon-db-node-2ae18c72-3 (54.78.104.138 | 10.4.0.230): 7 shards
longevity-200gb-48h-network-net-mon-db-node-2ae18c72-4 (52.19.158.242 | 10.4.0.25): 7 shards
longevity-200gb-48h-network-net-mon-db-node-2ae18c72-5 (52.215.84.146 | 10.4.2.124): 7 shards
longevity-200gb-48h-network-net-mon-db-node-2ae18c72-6 (54.229.105.254 | 10.4.1.158): 7 shards
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0b9c9dd9d3af4cec6 (aws: eu-west-1)
Test: longevity-200gb-48h-network-monkey-test
Test name: longevity_test.LongevityTest.test_custom_time
Test config file(s):
Issue description
Restore Monitor Stack command: $ hydra investigate show-monitor 2ae18c72-dee5-46da-a555-a41f33156257
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 2ae18c72-dee5-46da-a555-a41f33156257
Test id: 2ae18c72-dee5-46da-a555-a41f33156257
Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_132059/grafana-screenshot-longevity-200gb-48h-network-monkey-test-scylla-per-server-metrics-nemesis-20221106_132216-longevity-200gb-48h-network-net-mon-monitor-node-2ae18c72-1.png grafana - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_132059/grafana-screenshot-overview-20221106_132059-longevity-200gb-48h-network-net-mon-monitor-node-2ae18c72-1.png critical - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/critical-2ae18c72.log.tar.gz db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/db-cluster-2ae18c72.tar.gz debug - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/debug-2ae18c72.log.tar.gz email_data - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/email_data-2ae18c72.json.tar.gz error - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/error-2ae18c72.log.tar.gz event - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/events-2ae18c72.log.tar.gz left_processes - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/left_processes-2ae18c72.log.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/loader-set-2ae18c72.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/monitor-set-2ae18c72.tar.gz normal - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/normal-2ae18c72.log.tar.gz output - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/output-2ae18c72.log.tar.gz event - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/raw_events-2ae18c72.log.tar.gz sct - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/sct-2ae18c72.log.tar.gz summary - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/summary-2ae18c72.log.tar.gz warning - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/warning-2ae18c72.log.tar.gz
@enaydanov IIUC it reproduced the issue?
OK, so this is what I see from logs of node-1 broadcast_address=10.4.3.39:
12:41:27 : [shard 3] gossip - failure_detector_loop: Send echo to node 10.4.2.124, status = failed: seastar::rpc::timeout_error (rpc call timed out)
12:41:27 : [shard 3] gossip - failure_detector_loop: Mark node 10.4.2.124 as DOWN
12:41:27 : [shard 0] gossip - marking as down 10.4.2.124
12:41:27 : [shard 0] gossip - InetAddress 10.4.2.124 is now DOWN, status = NORMAL
it marked node-5 (10.4.2.124) as DOWN and (!) never since then marked is back as UP. However, it did receive syn msg from node-5 a bit later and tried to handle it
12:41:28 : [shard 0] gossip - handle_syn_msg():from=10.4.2.124:0,cluster_name:peer=longevity-200gb-48h-network-net-mon-db-cluster-2ae18c72,local=longevity-200gb-4>
12:41:28 : [shard 0] gossip - Process gossip syn msg from node 10.4.2.124:0, syn_msg=cluster_id:longevity-200gb-48h-network-net-mon-db-cluster-2ae18c72,partioner:>
...
12:41:28 : [shard 0] gossip - Calling do_send_ack_msg to node 10.4.2.124:0, syn_msg=cluster_id:longevity-200gb-48h-network-net-mon-db-cluster-2ae18c72,partioner:o>
and it continued receiving SYN messages from it many time since then
12:41:29 : [shard 0] gossip - handle_syn_msg():from=10.4.2.124:0,cluster_name:peer=longevity-200gb-48h-network-net-mon-db-cluster-2ae18c72,local=longevity-200gb-4>
12:41:29 : [shard 0] gossip - Queue gossip syn msg from node 10.4.2.124:0, syn_msg=cluster_id:longevity-200gb-48h-network-net-mon-db-cluster-2ae18c72,partioner:or>
but with no progress.
On node-5 logs I see symmetrical attempts to send SYNs, but no signs of receiving ACKs
12:41:27 : [shard 0] gossip - InetAddress 10.4.3.39 is now DOWN, status = NORMAL
...
12:41:28 : [shard 0] gossip - Talk to live nodes: {10.4.3.39}
12:41:28 : [shard 0] gossip - Sending a GossipDigestSyn to 10.4.3.39:0 ...
State transitions (from all logs):
node-1/system.log:Nov 06 10:31:28 node-1 scylla[7789]: [shard 0] gossip - InetAddress 10.4.3.87 is now UP, status = UNKNOWN
node-1/system.log:Nov 06 10:36:00 node-1 scylla[7789]: [shard 0] gossip - InetAddress 10.4.0.230 is now UP, status = UNKNOWN
node-1/system.log:Nov 06 10:39:49 node-1 scylla[7789]: [shard 0] gossip - InetAddress 10.4.0.25 is now UP, status = UNKNOWN
node-1/system.log:Nov 06 10:44:17 node-1 scylla[7789]: [shard 0] gossip - InetAddress 10.4.2.124 is now UP, status = UNKNOWN
node-1/system.log:Nov 06 10:48:48 node-1 scylla[7789]: [shard 0] gossip - InetAddress 10.4.1.158 is now UP, status = UNKNOWN
node-1/system.log:Nov 06 12:06:20 node-1 scylla[7789]: [shard 0] gossip - InetAddress 10.4.1.158 is now DOWN, status = NORMAL
node-1/system.log:Nov 06 12:15:17 node-1 scylla[7789]: [shard 0] gossip - InetAddress 10.4.1.158 is now UP, status = NORMAL
node-1/system.log:Nov 06 12:41:27 node-1 scylla[7789]: [shard 0] gossip - InetAddress 10.4.2.124 is now DOWN, status = NORMAL
node-2/system.log:Nov 06 10:31:28 node-2 scylla[8019]: [shard 0] gossip - InetAddress 10.4.3.39 is now UP, status = NORMAL
node-2/system.log:Nov 06 10:36:01 node-2 scylla[8019]: [shard 0] gossip - InetAddress 10.4.0.230 is now UP, status = UNKNOWN
node-2/system.log:Nov 06 10:39:50 node-2 scylla[8019]: [shard 0] gossip - InetAddress 10.4.0.25 is now UP, status = UNKNOWN
node-2/system.log:Nov 06 10:44:18 node-2 scylla[8019]: [shard 0] gossip - InetAddress 10.4.2.124 is now UP, status = UNKNOWN
node-2/system.log:Nov 06 10:48:49 node-2 scylla[8019]: [shard 0] gossip - InetAddress 10.4.1.158 is now UP, status = UNKNOWN
node-2/system.log:Nov 06 12:06:19 node-2 scylla[8019]: [shard 0] gossip - InetAddress 10.4.1.158 is now DOWN, status = NORMAL
node-2/system.log:Nov 06 12:15:24 node-2 scylla[8019]: [shard 0] gossip - InetAddress 10.4.1.158 is now UP, status = NORMAL
node-2/system.log:Nov 06 12:41:27 node-2 scylla[8019]: [shard 0] gossip - InetAddress 10.4.2.124 is now DOWN, status = NORMAL
node-3/system.log:Nov 06 10:36:00 node-3 scylla[8094]: [shard 0] gossip - InetAddress 10.4.3.39 is now UP, status = NORMAL
node-3/system.log:Nov 06 10:36:00 node-3 scylla[8094]: [shard 0] gossip - InetAddress 10.4.3.87 is now UP, status = NORMAL
node-3/system.log:Nov 06 10:39:50 node-3 scylla[8094]: [shard 0] gossip - InetAddress 10.4.0.25 is now UP, status = UNKNOWN
node-3/system.log:Nov 06 10:44:18 node-3 scylla[8094]: [shard 0] gossip - InetAddress 10.4.2.124 is now UP, status = UNKNOWN
node-3/system.log:Nov 06 10:48:48 node-3 scylla[8094]: [shard 0] gossip - InetAddress 10.4.1.158 is now UP, status = UNKNOWN
node-3/system.log:Nov 06 12:06:20 node-3 scylla[8094]: [shard 0] gossip - InetAddress 10.4.1.158 is now DOWN, status = NORMAL
node-3/system.log:Nov 06 12:15:20 node-3 scylla[8094]: [shard 0] gossip - InetAddress 10.4.1.158 is now UP, status = NORMAL
node-3/system.log:Nov 06 12:41:26 node-3 scylla[8094]: [shard 0] gossip - InetAddress 10.4.2.124 is now DOWN, status = NORMAL
node-4/system.log:Nov 06 10:39:49 node-4 scylla[8130]: [shard 0] gossip - InetAddress 10.4.3.39 is now UP, status = NORMAL
node-4/system.log:Nov 06 10:39:49 node-4 scylla[8130]: [shard 0] gossip - InetAddress 10.4.3.87 is now UP, status = NORMAL
node-4/system.log:Nov 06 10:39:49 node-4 scylla[8130]: [shard 0] gossip - InetAddress 10.4.0.230 is now UP, status = NORMAL
node-4/system.log:Nov 06 10:44:18 node-4 scylla[8130]: [shard 0] gossip - InetAddress 10.4.2.124 is now UP, status = UNKNOWN
node-4/system.log:Nov 06 10:48:49 node-4 scylla[8130]: [shard 0] gossip - InetAddress 10.4.1.158 is now UP, status = UNKNOWN
node-4/system.log:Nov 06 12:06:19 node-4 scylla[8130]: [shard 0] gossip - InetAddress 10.4.1.158 is now DOWN, status = NORMAL
node-4/system.log:Nov 06 12:15:18 node-4 scylla[8130]: [shard 0] gossip - InetAddress 10.4.1.158 is now UP, status = NORMAL
node-4/system.log:Nov 06 12:41:27 node-4 scylla[8130]: [shard 0] gossip - InetAddress 10.4.2.124 is now DOWN, status = NORMAL
node-5/system.log:Nov 06 10:44:17 node-5 scylla[8215]: [shard 0] gossip - InetAddress 10.4.3.39 is now UP, status = NORMAL
node-5/system.log:Nov 06 10:44:17 node-5 scylla[8215]: [shard 0] gossip - InetAddress 10.4.0.230 is now UP, status = NORMAL
node-5/system.log:Nov 06 10:44:17 node-5 scylla[8215]: [shard 0] gossip - InetAddress 10.4.0.25 is now UP, status = NORMAL
node-5/system.log:Nov 06 10:44:17 node-5 scylla[8215]: [shard 0] gossip - InetAddress 10.4.3.87 is now UP, status = NORMAL
node-5/system.log:Nov 06 10:48:49 node-5 scylla[8215]: [shard 0] gossip - InetAddress 10.4.1.158 is now UP, status = UNKNOWN
node-5/system.log:Nov 06 12:06:19 node-5 scylla[8215]: [shard 0] gossip - InetAddress 10.4.1.158 is now DOWN, status = NORMAL
node-5/system.log:Nov 06 12:15:17 node-5 scylla[8215]: [shard 0] gossip - InetAddress 10.4.1.158 is now UP, status = NORMAL
node-5/system.log:Nov 06 12:41:27 node-5 scylla[8215]: [shard 0] gossip - InetAddress 10.4.0.25 is now DOWN, status = NORMAL
node-5/system.log:Nov 06 12:41:27 node-5 scylla[8215]: [shard 0] gossip - InetAddress 10.4.3.39 is now DOWN, status = NORMAL
node-5/system.log:Nov 06 12:41:27 node-5 scylla[8215]: [shard 0] gossip - InetAddress 10.4.0.230 is now DOWN, status = NORMAL
node-5/system.log:Nov 06 12:41:27 node-5 scylla[8215]: [shard 0] gossip - InetAddress 10.4.3.87 is now DOWN, status = NORMAL
node-5/system.log:Nov 06 12:41:27 node-5 scylla[8215]: [shard 0] gossip - InetAddress 10.4.1.158 is now DOWN, status = NORMAL
node-6/system.log:Nov 06 10:48:48 node-6 scylla[8236]: [shard 0] gossip - InetAddress 10.4.3.39 is now UP, status = NORMAL
node-6/system.log:Nov 06 10:48:48 node-6 scylla[8236]: [shard 0] gossip - InetAddress 10.4.0.25 is now UP, status = NORMAL
node-6/system.log:Nov 06 10:48:48 node-6 scylla[8236]: [shard 0] gossip - InetAddress 10.4.0.230 is now UP, status = NORMAL
node-6/system.log:Nov 06 10:48:48 node-6 scylla[8236]: [shard 0] gossip - InetAddress 10.4.2.124 is now UP, status = NORMAL
node-6/system.log:Nov 06 10:48:48 node-6 scylla[8236]: [shard 0] gossip - InetAddress 10.4.3.87 is now UP, status = NORMAL
node-6/system.log:Nov 06 12:06:19 node-6 scylla[8236]: [shard 0] gossip - InetAddress 10.4.3.87 is now DOWN, status = NORMAL
node-6/system.log:Nov 06 12:06:19 node-6 scylla[8236]: [shard 0] gossip - InetAddress 10.4.0.230 is now DOWN, status = NORMAL
node-6/system.log:Nov 06 12:06:19 node-6 scylla[8236]: [shard 0] gossip - InetAddress 10.4.3.39 is now DOWN, status = NORMAL
node-6/system.log:Nov 06 12:06:21 node-6 scylla[8236]: [shard 0] gossip - InetAddress 10.4.2.124 is now DOWN, status = NORMAL
node-6/system.log:Nov 06 12:06:21 node-6 scylla[8236]: [shard 0] gossip - InetAddress 10.4.0.25 is now DOWN, status = NORMAL
node-6/system.log:Nov 06 12:15:17 node-6 scylla[8236]: [shard 0] gossip - InetAddress 10.4.0.25 is now UP, status = NORMAL
node-6/system.log:Nov 06 12:15:17 node-6 scylla[8236]: [shard 0] gossip - InetAddress 10.4.3.39 is now UP, status = NORMAL
node-6/system.log:Nov 06 12:15:17 node-6 scylla[8236]: [shard 0] gossip - InetAddress 10.4.0.230 is now UP, status = NORMAL
node-6/system.log:Nov 06 12:15:17 node-6 scylla[8236]: [shard 0] gossip - InetAddress 10.4.3.87 is now UP, status = NORMAL
node-6/system.log:Nov 06 12:15:27 node-6 scylla[8236]: [shard 0] gossip - InetAddress 10.4.2.124 is now UP, status = NORMAL
node-6/system.log:Nov 06 12:41:26 node-6 scylla[8236]: [shard 0] gossip - InetAddress 10.4.2.124 is now DOWN, status = NORMAL
All nodes mark 10.4.2.124 (node-5) DOWN and not bring it back, node-5 marks all other nodes DOWN and never brings them back UP
To sum up -- node-5 sends SYNs to other nodes, they reach them, got handled, nodes send back ACKs, but none of them reach node-5, no errors in logs
The last "message" from 10.4.3.39 on node-5 was
[shard 3] rpc - client 10.4.3.39:7001: client connection dropped: read: Connection reset by peer
it happened right after the .39 node was marked as DOWN
template <typename MsgIn, typename... MsgOut>
auto send_message(messaging_service* ms, messaging_verb verb, msg_addr id, MsgOut&&... msg) {
auto rpc_handler = ms->rpc()->make_client<MsgIn(MsgOut...)>(verb);
using futurator = futurize<std::result_of_t<decltype(rpc_handler)(rpc_protocol::client&, MsgOut...)>>;
if (ms->is_shutting_down()) {
return futurator::make_exception_future(rpc::closed_error());
}
auto rpc_client_ptr = ms->get_rpc_client(verb, id);
auto& rpc_client = *rpc_client_ptr;
return rpc_handler(rpc_client, std::forward<MsgOut>(msg)...).handle_exception([ms = ms->shared_from_this(), id, verb, rpc_client_ptr = std::move(rpc_client_ptr)] (std::exception_ptr&& eptr) {
ms->increment_dropped_messages(verb);
if (try_catch<rpc::closed_error>(eptr)) {
// This is a transport error
ms->remove_error_rpc_client(verb, id);
return futurator::make_exception_future(std::move(eptr));
} else {
// This is expected to be a rpc server error, e.g., the rpc handler throws a std::runtime_error.
return futurator::make_exception_future(std::move(eptr));
}
});
}
the client is only "invalidated" on the messaging service on closed_error. Is "connection reset by peer" an instance of this one?
Looks like no
A-ha. The outgoing queue wasn't reporting anything but closed_error on error. Now it propagates back to the caller the actual exception that happened. The send_message leaves the connection in m.s. tables and continues using it.
No, something is still wrong
return when_all(dst.send(std::move(data), timeout, cancel), wait_for_reply<Serializer>(wait(), timeout, cancel, dst, msg_id, sig)).then([] (auto r) {
std::get<0>(r).ignore_ready_future();
return std::move(std::get<1>(r)); // return future of wait_for_reply
});
failure to send is always ignored by send_helper()...