scylladb icon indicating copy to clipboard operation
scylladb copied to clipboard

Target node of NetworkStartStopInterface nemesis lost connectivity with other cluster nodes after stop/start of a network interface

Open enaydanov opened this issue 3 years ago • 6 comments

Installation details

Kernel Version: 5.15.0-1019-aws Scylla version (or git commit hash): 5.1.0~rc1-20220902.d10aee15e7e9 with build-id c127c717ecffa082ce97b94100d62b2549abe486 Relocatable Package: http://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-5.1/scylla-x86_64-package-5.1.0-rc1.0.20220902.d10aee15e7e9.tar.gz Cluster size: 6 nodes (i3.2xlarge)

Scylla Nodes used in this run:

  • longevity-200gb-48h-network-2022-2-db-node-a8763342-6 (52.50.122.8 | 10.4.2.245) (shards: 7)
  • longevity-200gb-48h-network-2022-2-db-node-a8763342-5 (34.254.31.220 | 10.4.3.132) (shards: 7)
  • longevity-200gb-48h-network-2022-2-db-node-a8763342-4 (3.250.254.20 | 10.4.3.7) (shards: 7)
  • longevity-200gb-48h-network-2022-2-db-node-a8763342-3 (63.34.238.51 | 10.4.2.46) (shards: 7)
  • longevity-200gb-48h-network-2022-2-db-node-a8763342-2 (52.31.220.37 | 10.4.1.211) (shards: 7)
  • longevity-200gb-48h-network-2022-2-db-node-a8763342-1 (54.77.2.178 | 10.4.3.65) (shards: 7)

OS / Image: ami-03f1a2d7537cf2e31 (aws: eu-west-1)

Test: longevity-200gb-48h-network-monkey-test Test id: a8763342-4c58-4f97-849c-91f14d075d82 Test name: enterprise-2022.2/longevity/longevity-200gb-48h-network-monkey-test Test config file(s):

Issue description

At 14:19:20 eth1 of node-2 brought down and at 14:20:20 returned up back:

2022-09-18T14:19:20+00:00 longevity-200gb-48h-network-2022-2-db-node-a8763342-2   !NOTICE | sudo[13483]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/sbin/ip link set eth1 down
2022-09-18T14:19:20+00:00 longevity-200gb-48h-network-2022-2-db-node-a8763342-2     !INFO | systemd-networkd[424]: eth1: Link DOWN
2022-09-18T14:19:20+00:00 longevity-200gb-48h-network-2022-2-db-node-a8763342-2     !INFO | systemd-networkd[424]: eth1: Lost carrier
2022-09-18T14:20:20+00:00 longevity-200gb-48h-network-2022-2-db-node-a8763342-2   !NOTICE | sudo[13504]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/sbin/ip link set eth1 up
2022-09-18T14:20:20+00:00 longevity-200gb-48h-network-2022-2-db-node-a8763342-2     !INFO | systemd-networkd[424]: eth1: Link UP
2022-09-18T14:20:20+00:00 longevity-200gb-48h-network-2022-2-db-node-a8763342-2     !INFO | kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
2022-09-18T14:20:20+00:00 longevity-200gb-48h-network-2022-2-db-node-a8763342-2     !INFO | systemd-networkd[424]: eth1: Gained carrier

But Scylla on this node was not able to restore connectivity w/ other nodes:

$ /usr/bin/nodetool -u cassandra -pw 'cassandra'

Datacenter: eu-west                                                                                                       
===================                                                                                                       
Status=Up/Down                                                                                                            
|/ State=Normal/Leaving/Joining/Moving                                                                                    
--  Address     Load       Tokens       Owns    Host ID                               Rack                                
UN  10.4.3.132  162.72 GB  256          ?       6c58c3f8-92e2-42f8-8235-01e989d86a69  1a                                  
UN  10.4.2.245  160.35 GB  256          ?       fddf7a4d-219f-4d41-9320-744386cca8a5  1a                                  
UN  10.4.3.7    161.3 GB   256          ?       6c9e7b70-1bb1-4e34-940e-97e5303cd4c4  1a                                  
UN  10.4.3.65   161.89 GB  256          ?       c2540cc5-a3e0-4b00-9258-c925ed6bc68b  1a                                  
DN  10.4.1.211  156.63 GB  256          ?       e616e82a-6cb8-423d-a6b6-9a43c68f5239  1a                                  
UN  10.4.2.46   157.83 GB  256          ?       baf5e471-f458-4eeb-9a95-3d509a7bc696  1a                                  
                                                                                                                          
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless       

But on node-2:

$ /usr/bin/nodetool -u cassandra -pw 'cassandra'

Datacenter: eu-west                                                                                                                  
===================                                                                                                                  
Status=Up/Down                                                                                                                       
|/ State=Normal/Leaving/Joining/Moving                                                                                               
--  Address     Load       Tokens       Owns    Host ID                               Rack                                           
DN  10.4.3.132  156.11 GB  256          ?       6c58c3f8-92e2-42f8-8235-01e989d86a69  1a                                             
DN  10.4.2.245  152.54 GB  256          ?       fddf7a4d-219f-4d41-9320-744386cca8a5  1a                                             
DN  10.4.3.7    155.06 GB  256          ?       6c9e7b70-1bb1-4e34-940e-97e5303cd4c4  1a                                             
DN  10.4.3.65   155.23 GB  256          ?       c2540cc5-a3e0-4b00-9258-c925ed6bc68b  1a                                             
UN  10.4.1.211  159.99 GB  256          ?       e616e82a-6cb8-423d-a6b6-9a43c68f5239  1a                                             
DN  10.4.2.46   149.72 GB  256          ?       baf5e471-f458-4eeb-9a95-3d509a7bc696  1a                                             
                                                                                                                                     
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless                  

Also there are following messages in Scylla' log on node-2:

Sep 18 14:20:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  90 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:22:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  180 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:23:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  270 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:25:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  360 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:26:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  450 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:28:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  540 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:29:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  630 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:31:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  720 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:32:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  810 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:34:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  900 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:35:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  990 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:37:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  1080 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:38:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  1170 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:40:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  1260 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:41:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  1350 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:43:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  1440 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:44:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  1530 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:46:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  1620 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:47:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  1710 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:49:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  1800 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:50:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  1890 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:52:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  1980 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:53:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  2070 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:55:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  2160 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:56:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  2250 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:58:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  2340 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 14:59:46 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  2430 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
Sep 18 15:01:16 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] service_level_controller - update_from_distributed_data: failed to update configuration for more than  2520 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
  • Restore Monitor Stack command: $ hydra investigate show-monitor a8763342-4c58-4f97-849c-91f14d075d82
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs a8763342-4c58-4f97-849c-91f14d075d82

Logs:

Jenkins job URL

enaydanov avatar Sep 22 '22 08:09 enaydanov

According to QA this is not a new test and previous releases it worked (verified on 5.0)

slivne avatar Sep 22 '22 11:09 slivne

From logs:

Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 4] gossip - failure_detector_loop: Send echo to node 10.4.3.65, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 4] gossip - failure_detector_loop: Mark node 10.4.3.65 as DOWN
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] gossip - InetAddress 10.4.3.65 is now DOWN, status = NORMAL
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 2] gossip - failure_detector_loop: Send echo to node 10.4.3.132, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 2] gossip - failure_detector_loop: Mark node 10.4.3.132 as DOWN
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] gossip - InetAddress 10.4.3.132 is now DOWN, status = NORMAL
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 1] gossip - failure_detector_loop: Send echo to node 10.4.2.46, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 1] gossip - failure_detector_loop: Mark node 10.4.2.46 as DOWN
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] gossip - InetAddress 10.4.2.46 is now DOWN, status = NORMAL
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 3] gossip - failure_detector_loop: Send echo to node 10.4.3.7, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 3] gossip - failure_detector_loop: Mark node 10.4.3.7 as DOWN
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] gossip - InetAddress 10.4.3.7 is now DOWN, status = NORMAL
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 3] rpc - client 10.4.3.7:7001: client connection dropped: read: Connection reset by peer
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] gossip - failure_detector_loop: Send echo to node 10.4.2.245, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] gossip - failure_detector_loop: Mark node 10.4.2.245 as DOWN
Sep 18 14:19:43 longevity-200gb-48h-network-2022-2-db-node-a8763342-2 scylla[7989]:  [shard 0] gossip - InetAddress 10.4.2.245 is now DOWN, status = NORMAL

and no more messages from gossiper since then

xemul avatar Sep 22 '22 11:09 xemul

Actually -- there are only messages from compaction, commitlog shutdown and some LeveledManifest since this time. Nothing more

xemul avatar Sep 22 '22 11:09 xemul

@asias , how is gossiper supposed to bring nodes back to UP state? Does it wait for a message from them or pings the nodes on his own?

xemul avatar Sep 22 '22 12:09 xemul

@enaydanov , can we re-run this with --logger-log-level gossip=debug ?

xemul avatar Sep 22 '22 12:09 xemul

@xemul yep, I've started a run: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/evgeniy/job/longevity-200gb-48h-network-monkey-test/2/

enaydanov avatar Sep 22 '22 13:09 enaydanov

Failed too:

Target node for NetworkStartStopInterface is node-4 that time:

2022-09-22T16:21:34+00:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4   !NOTICE | sudo[13114]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/sbin/ip link set eth1 down
2022-09-22T16:21:34+00:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4     !INFO | systemd-networkd[426]: eth1: Link DOWN
2022-09-22T16:21:35+00:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4     !INFO | systemd-networkd[426]: eth1: Lost carrier
2022-09-22T16:29:55+00:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4   !NOTICE | sudo[13233]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/sbin/ip link set eth1 up
2022-09-22T16:29:55+00:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4     !INFO | systemd-networkd[426]: eth1: Link UP
2022-09-22T16:29:55+00:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4     !INFO | kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
2022-09-22T16:29:55+00:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4     !INFO | systemd-networkd[426]: eth1: Gained carrier

Gossip logs on this node at this time:

...
Sep 22 16:21:34 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Set live nodes to talk: endpoint_state_map=6, all_live_nodes=[10.4.1.145, 10.4.2.154, 10.4.3.196, 10.4.0.248, 10.4.2.225], endpoints_to_talk_with=[{10.4.1.145}, {10.4.2.154}, {10.4.3.196}, {10.4.0.248}, {10.4.2.225}]
Sep 22 16:21:34 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.1.145}
Sep 22 16:21:34 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - apply_state_locally_endpoints=[10.4.0.248, 10.4.1.145, 10.4.2.225, 10.4.2.154]
Sep 22 16:21:34 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - apply_state_locally() took 1 ms
Sep 22 16:21:34 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Process gossip ack msg digests from node 10.4.1.145:0, ack_msg_digest=[10.4.2.247:1663856208:7836]
Sep 22 16:21:34 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Calling do_send_ack2_msg to node 10.4.1.145:0, ack_msg_digest=[10.4.2.247:1663856208:7836], ack2_msg=endpoint_state:{[10.4.2.247->HeartBeatState = { generation = 1663856208, version = 7837 }, AppStateMap =]}
Sep 22 16:21:34 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - No more queued gossip ack msg digests from node 10.4.1.145:0, ack_msg_digest={}, pending=false
Sep 22 16:21:35 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.2.154}
Sep 22 16:21:35 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 1] gossip - failure_detector_loop: Send echo to node 10.4.1.145, status = started
Sep 22 16:21:35 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 4] gossip - failure_detector_loop: Send echo to node 10.4.3.196, status = started
Sep 22 16:21:35 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 2] gossip - failure_detector_loop: Send echo to node 10.4.2.154, status = started
Sep 22 16:21:35 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - failure_detector_loop: Send echo to node 10.4.0.248, status = started
Sep 22 16:21:36 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 3] gossip - failure_detector_loop: Send echo to node 10.4.2.225, status = started
Sep 22 16:21:36 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.3.196}
Sep 22 16:21:37 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.0.248}
Sep 22 16:21:38 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.2.225}
Sep 22 16:21:39 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Set live nodes to talk: endpoint_state_map=6, all_live_nodes=[10.4.3.196, 10.4.0.248, 10.4.2.154, 10.4.2.225, 10.4.1.145], endpoints_to_talk_with=[{10.4.3.196}, {10.4.0.248}, {10.4.2.154}, {10.4.2.225}, {10.4.1.145}]
Sep 22 16:21:39 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.3.196}
Sep 22 16:21:40 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.0.248}
Sep 22 16:21:41 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.2.154}
Sep 22 16:21:42 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.2.225}
Sep 22 16:21:43 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.1.145}
Sep 22 16:21:44 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Set live nodes to talk: endpoint_state_map=6, all_live_nodes=[10.4.2.154, 10.4.1.145, 10.4.2.225, 10.4.0.248, 10.4.3.196], endpoints_to_talk_with=[{10.4.2.154}, {10.4.1.145}, {10.4.2.225}, {10.4.0.248}, {10.4.3.196}]
Sep 22 16:21:44 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.2.154}
Sep 22 16:21:45 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.1.145}
Sep 22 16:21:46 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.2.225}
Sep 22 16:21:47 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.0.248}
Sep 22 16:21:48 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.3.196}
Sep 22 16:21:49 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Set live nodes to talk: endpoint_state_map=6, all_live_nodes=[10.4.0.248, 10.4.2.225, 10.4.3.196, 10.4.2.154, 10.4.1.145], endpoints_to_talk_with=[{10.4.0.248}, {10.4.2.225}, {10.4.3.196}, {10.4.2.154}, {10.4.1.145}]
Sep 22 16:21:49 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.0.248}
Sep 22 16:21:50 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.2.225}
Sep 22 16:21:51 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.3.196}
Sep 22 16:21:52 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.2.154}
Sep 22 16:21:53 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.1.145}
Sep 22 16:21:54 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Set live nodes to talk: endpoint_state_map=6, all_live_nodes=[10.4.2.225, 10.4.1.145, 10.4.2.154, 10.4.0.248, 10.4.3.196], endpoints_to_talk_with=[{10.4.2.225}, {10.4.1.145}, {10.4.2.154}, {10.4.0.248}, {10.4.3.196}]
Sep 22 16:21:54 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.2.225}
Sep 22 16:21:55 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.1.145}
Sep 22 16:21:56 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.2.154}
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.0.248}
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 1] gossip - failure_detector_loop: Send echo to node 10.4.1.145, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 1] gossip - failure_detector_loop: Mark node 10.4.1.145 as DOWN
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - InetAddress 10.4.1.145 is now DOWN, status = NORMAL
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 4] gossip - failure_detector_loop: Send echo to node 10.4.3.196, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 4] gossip - failure_detector_loop: Mark node 10.4.3.196 as DOWN
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - InetAddress 10.4.3.196 is now DOWN, status = NORMAL
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 2] gossip - failure_detector_loop: Send echo to node 10.4.2.154, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 2] gossip - failure_detector_loop: Mark node 10.4.2.154 as DOWN
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - InetAddress 10.4.2.154 is now DOWN, status = NORMAL
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - failure_detector_loop: Send echo to node 10.4.0.248, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - failure_detector_loop: Mark node 10.4.0.248 as DOWN
Sep 22 16:21:57 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - InetAddress 10.4.0.248 is now DOWN, status = NORMAL
Sep 22 16:21:58 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 3] gossip - failure_detector_loop: Send echo to node 10.4.2.225, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Sep 22 16:21:58 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 3] gossip - failure_detector_loop: Mark node 10.4.2.225 as DOWN
Sep 22 16:21:58 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - InetAddress 10.4.2.225 is now DOWN, status = NORMAL
Sep 22 16:21:58 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - failure_detector_loop: previous_live_nodes=[10.4.0.248, 10.4.1.145, 10.4.2.154, 10.4.2.225, 10.4.3.196], current_live_nodes=[], nodes_down=[10.4.0.248, 10.4.1.145, 10.4.2.154, 10.4.2.225, 10.4.3.196]
Sep 22 16:21:58 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - failure_detector_loop: previous_live_nodes=[10.4.0.248, 10.4.1.145, 10.4.2.154, 10.4.2.225, 10.4.3.196], current_live_nodes=[], nodes_down=[10.4.0.248, 10.4.1.145, 10.4.2.154, 10.4.2.225, 10.4.3.196], version_before=21, version_after=21
Sep 22 16:21:58 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - failure_detector_loop: Wait until live_nodes=[] is not empty
Sep 22 16:21:58 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.3.196}
Sep 22 16:21:59 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - failure_detector_loop: Wait until live_nodes=[] is not empty
Sep 22 16:21:59 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Set live nodes to talk: endpoint_state_map=6, all_live_nodes=[], endpoints_to_talk_with=[]
Sep 22 16:21:59 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - No live nodes yet: try initial contact point nodes={10.4.0.248}
Sep 22 16:21:59 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.0.248}
Sep 22 16:22:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - failure_detector_loop: Wait until live_nodes=[] is not empty
Sep 22 16:22:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Set live nodes to talk: endpoint_state_map=6, all_live_nodes=[], endpoints_to_talk_with=[]
Sep 22 16:22:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - No live nodes yet: try initial contact point nodes={10.4.0.248}
Sep 22 16:22:00 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.0.248}
Sep 22 16:22:01 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - failure_detector_loop: Wait until live_nodes=[] is not empty
Sep 22 16:22:01 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Set live nodes to talk: endpoint_state_map=6, all_live_nodes=[], endpoints_to_talk_with=[]
Sep 22 16:22:01 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - No live nodes yet: try initial contact point nodes={10.4.0.248}
Sep 22 16:22:01 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.0.248}
Sep 22 16:22:02 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - failure_detector_loop: Wait until live_nodes=[] is not empty
Sep 22 16:22:02 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Set live nodes to talk: endpoint_state_map=6, all_live_nodes=[], endpoints_to_talk_with=[]
Sep 22 16:22:02 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - No live nodes yet: try initial contact point nodes={10.4.0.248}
Sep 22 16:22:02 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Talk to live nodes: {10.4.0.248}
Sep 22 16:22:03 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - failure_detector_loop: Wait until live_nodes=[] is not empty
Sep 22 16:22:03 longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 scylla[8143]:  [shard 0] gossip - Set live nodes to talk: endpoint_state_map=6, all_live_nodes=[], endpoints_to_talk_with=[]
...

Last 4 messages repeated till end of the log.

Installation details Kernel version: 5.15.0-1019-aws Scylla version (or git commit hash): 5.1.0~rc1-0.20220902.d10aee15e7e9 with build-id c127c717ecffa082ce97b94100d62b2549abe486 Relocatable package url: http://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-5.1/scylla-x86_64-package-5.1.0-rc1.0.20220902.d10aee15e7e9.tar.gz Cluster size: 6 nodes (i3.2xlarge) Scylla running with shards number (live nodes): longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-1 (34.241.64.42 | 10.4.0.248): 7 shards longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-2 (34.250.224.180 | 10.4.2.154): 7 shards longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-3 (18.200.176.129 | 10.4.3.196): 7 shards longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-4 (52.213.54.137 | 10.4.2.247): 7 shards longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-5 (176.34.235.243 | 10.4.1.145): 7 shards longevity-200gb-48h-network-net-mon-db-node-ce5ddf02-6 (18.200.177.169 | 10.4.2.225): 7 shards OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-03f1a2d7537cf2e31 (aws: eu-west-1)

Test: longevity-200gb-48h-network-monkey-test Test name: longevity_test.LongevityTest.test_custom_time Test config file(s):

Restore Monitor Stack command: $ hydra investigate show-monitor ce5ddf02-67cf-4194-9ae3-cc51e1b64a6c Restore monitor on AWS instance using Jenkins job Show all stored logs command: $ hydra investigate show-logs ce5ddf02-67cf-4194-9ae3-cc51e1b64a6c

Test id: ce5ddf02-67cf-4194-9ae3-cc51e1b64a6c

Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/ce5ddf02-67cf-4194-9ae3-cc51e1b64a6c/20220922_170051/grafana-screenshot-longevity-200gb-48h-network-monkey-test-scylla-per-server-metrics-nemesis-20220922_170202-longevity-200gb-48h-network-net-mon-monitor-node-ce5ddf02-1.png grafana - https://cloudius-jenkins-test.s3.amazonaws.com/ce5ddf02-67cf-4194-9ae3-cc51e1b64a6c/20220922_170051/grafana-screenshot-overview-20220922_170051-longevity-200gb-48h-network-net-mon-monitor-node-ce5ddf02-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/ce5ddf02-67cf-4194-9ae3-cc51e1b64a6c/20220922_171052/db-cluster-ce5ddf02.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/ce5ddf02-67cf-4194-9ae3-cc51e1b64a6c/20220922_171052/loader-set-ce5ddf02.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/ce5ddf02-67cf-4194-9ae3-cc51e1b64a6c/20220922_171052/monitor-set-ce5ddf02.tar.gz sct - https://cloudius-jenkins-test.s3.amazonaws.com/ce5ddf02-67cf-4194-9ae3-cc51e1b64a6c/20220922_171052/sct-runner-ce5ddf02.tar.gz

Jenkins job URL

enaydanov avatar Sep 25 '22 08:09 enaydanov

But Scylla on this node was not able to restore connectivity w/ other nodes:

this node == which node here? You mentioned node-2, which node you got the following nodetool status

$ /usr/bin/nodetool -u cassandra -pw 'cassandra'

Datacenter: eu-west                                                                                                       
===================                                                                                                       
Status=Up/Down                                                                                                            
|/ State=Normal/Leaving/Joining/Moving                                                                                    
--  Address     Load       Tokens       Owns    Host ID                               Rack                                
UN  10.4.3.132  162.72 GB  256          ?       6c58c3f8-92e2-42f8-8235-01e989d86a69  1a                                  
UN  10.4.2.245  160.35 GB  256          ?       fddf7a4d-219f-4d41-9320-744386cca8a5  1a                                  
UN  10.4.3.7    161.3 GB   256          ?       6c9e7b70-1bb1-4e34-940e-97e5303cd4c4  1a                                  
UN  10.4.3.65   161.89 GB  256          ?       c2540cc5-a3e0-4b00-9258-c925ed6bc68b  1a                                  
DN  10.4.1.211  156.63 GB  256          ?       e616e82a-6cb8-423d-a6b6-9a43c68f5239  1a                                  
UN  10.4.2.46   157.83 GB  256          ?       baf5e471-f458-4eeb-9a95-3d509a7bc696  1a                                  
                                                                                                                          
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless       

asias avatar Sep 26 '22 07:09 asias

Is this a new test? Did this test pass previously?

asias avatar Sep 26 '22 07:09 asias

In the test performed here: https://github.com/scylladb/scylladb/issues/11609#issuecomment-1257148947

we stopped the nic on node4 at 2022-09-22T16:21:34+00:00 and resumed the nic at 2022-09-22T16:29:55+00:00. Did we perform any other operations, e.g., ip tables to block specific node, down nic of other nodes?

asias avatar Sep 26 '22 07:09 asias

But Scylla on this node was not able to restore connectivity w/ other nodes:

this node == which node here? You mentioned node-2, which node you got the following nodetool status

$ /usr/bin/nodetool -u cassandra -pw 'cassandra'

Datacenter: eu-west                                                                                                       
===================                                                                                                       
Status=Up/Down                                                                                                            
|/ State=Normal/Leaving/Joining/Moving                                                                                    
--  Address     Load       Tokens       Owns    Host ID                               Rack                                
UN  10.4.3.132  162.72 GB  256          ?       6c58c3f8-92e2-42f8-8235-01e989d86a69  1a                                  
UN  10.4.2.245  160.35 GB  256          ?       fddf7a4d-219f-4d41-9320-744386cca8a5  1a                                  
UN  10.4.3.7    161.3 GB   256          ?       6c9e7b70-1bb1-4e34-940e-97e5303cd4c4  1a                                  
UN  10.4.3.65   161.89 GB  256          ?       c2540cc5-a3e0-4b00-9258-c925ed6bc68b  1a                                  
DN  10.4.1.211  156.63 GB  256          ?       e616e82a-6cb8-423d-a6b6-9a43c68f5239  1a                                  
UN  10.4.2.46   157.83 GB  256          ?       baf5e471-f458-4eeb-9a95-3d509a7bc696  1a                                  
                                                                                                                          
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless       

Rest nodes (node-1, node-3, etc.)

enaydanov avatar Sep 26 '22 10:09 enaydanov

Is this a new test? Did this test pass previously?

Yes, it passed on 5.0: http://argus.scylladb.com/test/86894c00-21ea-44a7-b6a8-932a2152d343/runs?additionalRuns[]=c8caff93-a98a-4963-b668-4d074b909494

enaydanov avatar Sep 26 '22 10:09 enaydanov

In the test performed here: #11609 (comment)

we stopped the nic on node4 at 2022-09-22T16:21:34+00:00 and resumed the nic at 2022-09-22T16:29:55+00:00. Did we perform any other operations, e.g., ip tables to block specific node, down nic of other nodes?

According to SCT code, no. It should be just a sleep:

https://github.com/scylladb/scylla-cluster-tests/blob/37a53dcd742e0fe996b57ad0f8df7d9349e29e06/sdcm/nemesis.py#L3020

enaydanov avatar Sep 26 '22 10:09 enaydanov

[shard 0] gossip - Talk to live nodes: {10.4.0.248}

                    logger.debug("Talk to live nodes: {}", live_nodes);
                    for (auto& ep: live_nodes) {
                        // Do it in the background.
                        (void)do_gossip_to_live_member(message, ep).handle_exception([] (auto ep) {
                            logger.trace("Failed to do_gossip_to_live_member: {}", ep);
                        });
                    }

:disappointed: no gossiping errors are ever reported to logs/stats/metrics/whatever

xemul avatar Sep 26 '22 11:09 xemul

Checked on c5285bcb143a34b9bca6a6d167624237c994052d . After I put ip l s veth1 up gossiper eventually marks the node Up-Normal. There must be something with the setup @enaydanov . Did anything change in test or nodes set-up since then?

xemul avatar Sep 26 '22 13:09 xemul

Same for d10aee15e7e98354897ee18c58cf8d5fa0677212

xemul avatar Sep 26 '22 13:09 xemul

@xemul Are you aware of any RPC issue that could make the send of the one way message hang forever in 5.1?

asias avatar Sep 27 '22 11:09 asias

@enaydanov , can we re-run this with --logger-log-level gossip=trace , please?

xemul avatar Oct 19 '22 09:10 xemul

ping @enaydanov

fgelcer avatar Nov 03 '22 18:11 fgelcer

Started the run: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/evgeniy/job/longevity-200gb-48h-network-monkey-test/3/

enaydanov avatar Nov 06 '22 10:11 enaydanov

@xemul, please, take a look. It's a run with --logger-log-level gossip=trace

Installation details

Kernel version: 5.15.0-1021-aws Scylla version (or git commit hash): 2022.2.0~rc3-0.20221009.994a5f0fbb4c with build-id 756ea8d62c25ed4acdf087054e11b3d07596a117 Relocatable package url: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-x86_64-package-2022.2.0-rc3.0.20221009.994a5f0fbb4c.tar.gz Cluster size: 6 nodes (i3.2xlarge) Scylla running with shards number (live nodes): longevity-200gb-48h-network-net-mon-db-node-2ae18c72-1 (54.171.53.38 | 10.4.3.39): 7 shards longevity-200gb-48h-network-net-mon-db-node-2ae18c72-2 (34.254.138.201 | 10.4.3.87): 7 shards longevity-200gb-48h-network-net-mon-db-node-2ae18c72-3 (54.78.104.138 | 10.4.0.230): 7 shards longevity-200gb-48h-network-net-mon-db-node-2ae18c72-4 (52.19.158.242 | 10.4.0.25): 7 shards longevity-200gb-48h-network-net-mon-db-node-2ae18c72-5 (52.215.84.146 | 10.4.2.124): 7 shards longevity-200gb-48h-network-net-mon-db-node-2ae18c72-6 (54.229.105.254 | 10.4.1.158): 7 shards OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0b9c9dd9d3af4cec6 (aws: eu-west-1)

Test: longevity-200gb-48h-network-monkey-test Test name: longevity_test.LongevityTest.test_custom_time Test config file(s):

Issue description

Restore Monitor Stack command: $ hydra investigate show-monitor 2ae18c72-dee5-46da-a555-a41f33156257 Restore monitor on AWS instance using Jenkins job Show all stored logs command: $ hydra investigate show-logs 2ae18c72-dee5-46da-a555-a41f33156257

Test id: 2ae18c72-dee5-46da-a555-a41f33156257

Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_132059/grafana-screenshot-longevity-200gb-48h-network-monkey-test-scylla-per-server-metrics-nemesis-20221106_132216-longevity-200gb-48h-network-net-mon-monitor-node-2ae18c72-1.png grafana - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_132059/grafana-screenshot-overview-20221106_132059-longevity-200gb-48h-network-net-mon-monitor-node-2ae18c72-1.png critical - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/critical-2ae18c72.log.tar.gz db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/db-cluster-2ae18c72.tar.gz debug - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/debug-2ae18c72.log.tar.gz email_data - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/email_data-2ae18c72.json.tar.gz error - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/error-2ae18c72.log.tar.gz event - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/events-2ae18c72.log.tar.gz left_processes - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/left_processes-2ae18c72.log.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/loader-set-2ae18c72.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/monitor-set-2ae18c72.tar.gz normal - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/normal-2ae18c72.log.tar.gz output - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/output-2ae18c72.log.tar.gz event - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/raw_events-2ae18c72.log.tar.gz sct - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/sct-2ae18c72.log.tar.gz summary - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/summary-2ae18c72.log.tar.gz warning - https://cloudius-jenkins-test.s3.amazonaws.com/2ae18c72-dee5-46da-a555-a41f33156257/20221106_133135/warning-2ae18c72.log.tar.gz

Jenkins job URL

enaydanov avatar Nov 08 '22 13:11 enaydanov

@enaydanov IIUC it reproduced the issue?

roydahan avatar Nov 08 '22 13:11 roydahan

OK, so this is what I see from logs of node-1 broadcast_address=10.4.3.39:

12:41:27 :  [shard 3] gossip - failure_detector_loop: Send echo to node 10.4.2.124, status = failed: seastar::rpc::timeout_error (rpc call timed out)
12:41:27 :  [shard 3] gossip - failure_detector_loop: Mark node 10.4.2.124 as DOWN
12:41:27 :  [shard 0] gossip - marking as down 10.4.2.124
12:41:27 :  [shard 0] gossip - InetAddress 10.4.2.124 is now DOWN, status = NORMAL

it marked node-5 (10.4.2.124) as DOWN and (!) never since then marked is back as UP. However, it did receive syn msg from node-5 a bit later and tried to handle it

12:41:28 : [shard 0] gossip - handle_syn_msg():from=10.4.2.124:0,cluster_name:peer=longevity-200gb-48h-network-net-mon-db-cluster-2ae18c72,local=longevity-200gb-4>
12:41:28 : [shard 0] gossip - Process gossip syn msg from node 10.4.2.124:0, syn_msg=cluster_id:longevity-200gb-48h-network-net-mon-db-cluster-2ae18c72,partioner:>
...
12:41:28 : [shard 0] gossip - Calling do_send_ack_msg to node 10.4.2.124:0, syn_msg=cluster_id:longevity-200gb-48h-network-net-mon-db-cluster-2ae18c72,partioner:o>

and it continued receiving SYN messages from it many time since then

12:41:29 :  [shard 0] gossip - handle_syn_msg():from=10.4.2.124:0,cluster_name:peer=longevity-200gb-48h-network-net-mon-db-cluster-2ae18c72,local=longevity-200gb-4>
12:41:29 :  [shard 0] gossip - Queue gossip syn msg from node 10.4.2.124:0, syn_msg=cluster_id:longevity-200gb-48h-network-net-mon-db-cluster-2ae18c72,partioner:or>

but with no progress.

On node-5 logs I see symmetrical attempts to send SYNs, but no signs of receiving ACKs

12:41:27 : [shard 0] gossip - InetAddress 10.4.3.39 is now DOWN, status = NORMAL
...
12:41:28 :  [shard 0] gossip - Talk to live nodes: {10.4.3.39}
12:41:28 :  [shard 0] gossip - Sending a GossipDigestSyn to 10.4.3.39:0 ...

xemul avatar Nov 08 '22 15:11 xemul

State transitions (from all logs):

node-1/system.log:Nov 06 10:31:28 node-1 scylla[7789]:  [shard 0] gossip - InetAddress 10.4.3.87 is now UP, status = UNKNOWN
node-1/system.log:Nov 06 10:36:00 node-1 scylla[7789]:  [shard 0] gossip - InetAddress 10.4.0.230 is now UP, status = UNKNOWN
node-1/system.log:Nov 06 10:39:49 node-1 scylla[7789]:  [shard 0] gossip - InetAddress 10.4.0.25 is now UP, status = UNKNOWN
node-1/system.log:Nov 06 10:44:17 node-1 scylla[7789]:  [shard 0] gossip - InetAddress 10.4.2.124 is now UP, status = UNKNOWN
node-1/system.log:Nov 06 10:48:48 node-1 scylla[7789]:  [shard 0] gossip - InetAddress 10.4.1.158 is now UP, status = UNKNOWN
node-1/system.log:Nov 06 12:06:20 node-1 scylla[7789]:  [shard 0] gossip - InetAddress 10.4.1.158 is now DOWN, status = NORMAL
node-1/system.log:Nov 06 12:15:17 node-1 scylla[7789]:  [shard 0] gossip - InetAddress 10.4.1.158 is now UP, status = NORMAL
node-1/system.log:Nov 06 12:41:27 node-1 scylla[7789]:  [shard 0] gossip - InetAddress 10.4.2.124 is now DOWN, status = NORMAL
node-2/system.log:Nov 06 10:31:28 node-2 scylla[8019]:  [shard 0] gossip - InetAddress 10.4.3.39 is now UP, status = NORMAL
node-2/system.log:Nov 06 10:36:01 node-2 scylla[8019]:  [shard 0] gossip - InetAddress 10.4.0.230 is now UP, status = UNKNOWN
node-2/system.log:Nov 06 10:39:50 node-2 scylla[8019]:  [shard 0] gossip - InetAddress 10.4.0.25 is now UP, status = UNKNOWN
node-2/system.log:Nov 06 10:44:18 node-2 scylla[8019]:  [shard 0] gossip - InetAddress 10.4.2.124 is now UP, status = UNKNOWN
node-2/system.log:Nov 06 10:48:49 node-2 scylla[8019]:  [shard 0] gossip - InetAddress 10.4.1.158 is now UP, status = UNKNOWN
node-2/system.log:Nov 06 12:06:19 node-2 scylla[8019]:  [shard 0] gossip - InetAddress 10.4.1.158 is now DOWN, status = NORMAL
node-2/system.log:Nov 06 12:15:24 node-2 scylla[8019]:  [shard 0] gossip - InetAddress 10.4.1.158 is now UP, status = NORMAL
node-2/system.log:Nov 06 12:41:27 node-2 scylla[8019]:  [shard 0] gossip - InetAddress 10.4.2.124 is now DOWN, status = NORMAL
node-3/system.log:Nov 06 10:36:00 node-3 scylla[8094]:  [shard 0] gossip - InetAddress 10.4.3.39 is now UP, status = NORMAL
node-3/system.log:Nov 06 10:36:00 node-3 scylla[8094]:  [shard 0] gossip - InetAddress 10.4.3.87 is now UP, status = NORMAL
node-3/system.log:Nov 06 10:39:50 node-3 scylla[8094]:  [shard 0] gossip - InetAddress 10.4.0.25 is now UP, status = UNKNOWN
node-3/system.log:Nov 06 10:44:18 node-3 scylla[8094]:  [shard 0] gossip - InetAddress 10.4.2.124 is now UP, status = UNKNOWN
node-3/system.log:Nov 06 10:48:48 node-3 scylla[8094]:  [shard 0] gossip - InetAddress 10.4.1.158 is now UP, status = UNKNOWN
node-3/system.log:Nov 06 12:06:20 node-3 scylla[8094]:  [shard 0] gossip - InetAddress 10.4.1.158 is now DOWN, status = NORMAL
node-3/system.log:Nov 06 12:15:20 node-3 scylla[8094]:  [shard 0] gossip - InetAddress 10.4.1.158 is now UP, status = NORMAL
node-3/system.log:Nov 06 12:41:26 node-3 scylla[8094]:  [shard 0] gossip - InetAddress 10.4.2.124 is now DOWN, status = NORMAL
node-4/system.log:Nov 06 10:39:49 node-4 scylla[8130]:  [shard 0] gossip - InetAddress 10.4.3.39 is now UP, status = NORMAL
node-4/system.log:Nov 06 10:39:49 node-4 scylla[8130]:  [shard 0] gossip - InetAddress 10.4.3.87 is now UP, status = NORMAL
node-4/system.log:Nov 06 10:39:49 node-4 scylla[8130]:  [shard 0] gossip - InetAddress 10.4.0.230 is now UP, status = NORMAL
node-4/system.log:Nov 06 10:44:18 node-4 scylla[8130]:  [shard 0] gossip - InetAddress 10.4.2.124 is now UP, status = UNKNOWN
node-4/system.log:Nov 06 10:48:49 node-4 scylla[8130]:  [shard 0] gossip - InetAddress 10.4.1.158 is now UP, status = UNKNOWN
node-4/system.log:Nov 06 12:06:19 node-4 scylla[8130]:  [shard 0] gossip - InetAddress 10.4.1.158 is now DOWN, status = NORMAL
node-4/system.log:Nov 06 12:15:18 node-4 scylla[8130]:  [shard 0] gossip - InetAddress 10.4.1.158 is now UP, status = NORMAL
node-4/system.log:Nov 06 12:41:27 node-4 scylla[8130]:  [shard 0] gossip - InetAddress 10.4.2.124 is now DOWN, status = NORMAL
node-5/system.log:Nov 06 10:44:17 node-5 scylla[8215]:  [shard 0] gossip - InetAddress 10.4.3.39 is now UP, status = NORMAL
node-5/system.log:Nov 06 10:44:17 node-5 scylla[8215]:  [shard 0] gossip - InetAddress 10.4.0.230 is now UP, status = NORMAL
node-5/system.log:Nov 06 10:44:17 node-5 scylla[8215]:  [shard 0] gossip - InetAddress 10.4.0.25 is now UP, status = NORMAL
node-5/system.log:Nov 06 10:44:17 node-5 scylla[8215]:  [shard 0] gossip - InetAddress 10.4.3.87 is now UP, status = NORMAL
node-5/system.log:Nov 06 10:48:49 node-5 scylla[8215]:  [shard 0] gossip - InetAddress 10.4.1.158 is now UP, status = UNKNOWN
node-5/system.log:Nov 06 12:06:19 node-5 scylla[8215]:  [shard 0] gossip - InetAddress 10.4.1.158 is now DOWN, status = NORMAL
node-5/system.log:Nov 06 12:15:17 node-5 scylla[8215]:  [shard 0] gossip - InetAddress 10.4.1.158 is now UP, status = NORMAL
node-5/system.log:Nov 06 12:41:27 node-5 scylla[8215]:  [shard 0] gossip - InetAddress 10.4.0.25 is now DOWN, status = NORMAL
node-5/system.log:Nov 06 12:41:27 node-5 scylla[8215]:  [shard 0] gossip - InetAddress 10.4.3.39 is now DOWN, status = NORMAL
node-5/system.log:Nov 06 12:41:27 node-5 scylla[8215]:  [shard 0] gossip - InetAddress 10.4.0.230 is now DOWN, status = NORMAL
node-5/system.log:Nov 06 12:41:27 node-5 scylla[8215]:  [shard 0] gossip - InetAddress 10.4.3.87 is now DOWN, status = NORMAL
node-5/system.log:Nov 06 12:41:27 node-5 scylla[8215]:  [shard 0] gossip - InetAddress 10.4.1.158 is now DOWN, status = NORMAL
node-6/system.log:Nov 06 10:48:48 node-6 scylla[8236]:  [shard 0] gossip - InetAddress 10.4.3.39 is now UP, status = NORMAL
node-6/system.log:Nov 06 10:48:48 node-6 scylla[8236]:  [shard 0] gossip - InetAddress 10.4.0.25 is now UP, status = NORMAL
node-6/system.log:Nov 06 10:48:48 node-6 scylla[8236]:  [shard 0] gossip - InetAddress 10.4.0.230 is now UP, status = NORMAL
node-6/system.log:Nov 06 10:48:48 node-6 scylla[8236]:  [shard 0] gossip - InetAddress 10.4.2.124 is now UP, status = NORMAL
node-6/system.log:Nov 06 10:48:48 node-6 scylla[8236]:  [shard 0] gossip - InetAddress 10.4.3.87 is now UP, status = NORMAL
node-6/system.log:Nov 06 12:06:19 node-6 scylla[8236]:  [shard 0] gossip - InetAddress 10.4.3.87 is now DOWN, status = NORMAL
node-6/system.log:Nov 06 12:06:19 node-6 scylla[8236]:  [shard 0] gossip - InetAddress 10.4.0.230 is now DOWN, status = NORMAL
node-6/system.log:Nov 06 12:06:19 node-6 scylla[8236]:  [shard 0] gossip - InetAddress 10.4.3.39 is now DOWN, status = NORMAL
node-6/system.log:Nov 06 12:06:21 node-6 scylla[8236]:  [shard 0] gossip - InetAddress 10.4.2.124 is now DOWN, status = NORMAL
node-6/system.log:Nov 06 12:06:21 node-6 scylla[8236]:  [shard 0] gossip - InetAddress 10.4.0.25 is now DOWN, status = NORMAL
node-6/system.log:Nov 06 12:15:17 node-6 scylla[8236]:  [shard 0] gossip - InetAddress 10.4.0.25 is now UP, status = NORMAL
node-6/system.log:Nov 06 12:15:17 node-6 scylla[8236]:  [shard 0] gossip - InetAddress 10.4.3.39 is now UP, status = NORMAL
node-6/system.log:Nov 06 12:15:17 node-6 scylla[8236]:  [shard 0] gossip - InetAddress 10.4.0.230 is now UP, status = NORMAL
node-6/system.log:Nov 06 12:15:17 node-6 scylla[8236]:  [shard 0] gossip - InetAddress 10.4.3.87 is now UP, status = NORMAL
node-6/system.log:Nov 06 12:15:27 node-6 scylla[8236]:  [shard 0] gossip - InetAddress 10.4.2.124 is now UP, status = NORMAL
node-6/system.log:Nov 06 12:41:26 node-6 scylla[8236]:  [shard 0] gossip - InetAddress 10.4.2.124 is now DOWN, status = NORMAL

All nodes mark 10.4.2.124 (node-5) DOWN and not bring it back, node-5 marks all other nodes DOWN and never brings them back UP

xemul avatar Nov 08 '22 16:11 xemul

To sum up -- node-5 sends SYNs to other nodes, they reach them, got handled, nodes send back ACKs, but none of them reach node-5, no errors in logs

xemul avatar Nov 08 '22 16:11 xemul

The last "message" from 10.4.3.39 on node-5 was

[shard 3] rpc - client 10.4.3.39:7001: client connection dropped: read: Connection reset by peer

it happened right after the .39 node was marked as DOWN

xemul avatar Nov 08 '22 16:11 xemul

template <typename MsgIn, typename... MsgOut>
auto send_message(messaging_service* ms, messaging_verb verb, msg_addr id, MsgOut&&... msg) {
    auto rpc_handler = ms->rpc()->make_client<MsgIn(MsgOut...)>(verb);
    using futurator = futurize<std::result_of_t<decltype(rpc_handler)(rpc_protocol::client&, MsgOut...)>>;
    if (ms->is_shutting_down()) {
        return futurator::make_exception_future(rpc::closed_error());
    }
    auto rpc_client_ptr = ms->get_rpc_client(verb, id);
    auto& rpc_client = *rpc_client_ptr;
    return rpc_handler(rpc_client, std::forward<MsgOut>(msg)...).handle_exception([ms = ms->shared_from_this(), id, verb, rpc_client_ptr = std::move(rpc_client_ptr)] (std::exception_ptr&& eptr) {
        ms->increment_dropped_messages(verb);
        if (try_catch<rpc::closed_error>(eptr)) {
            // This is a transport error
            ms->remove_error_rpc_client(verb, id);
            return futurator::make_exception_future(std::move(eptr));
        } else {
            // This is expected to be a rpc server error, e.g., the rpc handler throws a std::runtime_error.
            return futurator::make_exception_future(std::move(eptr));
        }   
    });     
}

the client is only "invalidated" on the messaging service on closed_error. Is "connection reset by peer" an instance of this one?

xemul avatar Nov 08 '22 16:11 xemul

Looks like no

xemul avatar Nov 08 '22 17:11 xemul

A-ha. The outgoing queue wasn't reporting anything but closed_error on error. Now it propagates back to the caller the actual exception that happened. The send_message leaves the connection in m.s. tables and continues using it.

xemul avatar Nov 08 '22 18:11 xemul

No, something is still wrong

            return when_all(dst.send(std::move(data), timeout, cancel), wait_for_reply<Serializer>(wait(), timeout, cancel, dst, msg_id, sig)).then([] (auto r) {
                    std::get<0>(r).ignore_ready_future();
                    return std::move(std::get<1>(r)); // return future of wait_for_reply
            });

failure to send is always ignored by send_helper()...

xemul avatar Nov 08 '22 19:11 xemul