java-driver icon indicating copy to clipboard operation
java-driver copied to clipboard

driver reports connection errors for decommissioned nodes and a delay for cassandra-stress, resulted with: "Command did not complete within 12870 seconds!"

Open yarongilor opened this issue 11 months ago • 7 comments

Packages

Scylla version: 6.3.0~dev-20241217.01cdba9a9894 with build-id f5cdbc08a2634f6f378e901fbb10a27fc164783e

Kernel Version: 6.8.0-1018-azure

Issue description

  • [ ] This issue is a regression.
  • [ ] It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Run a 3 hours longevity on Azure. The node longevity-10gb-3h-master-db-node-413a3a9b-eastus-2 was decommissioned at 2:

disrupt_nodetool_seed_decommission	longevity-10gb-3h-master-db-node-413a3a9b-eastus-2	Succeeded	2024-12-18 01:54:18	2024-12-18 02:07:15
Nemesis Information
Class: Sisyphus
Name: disrupt_nodetool_seed_decommission
Status: Succeeded
Target Information
Name: longevity-10gb-3h-master-db-node-413a3a9b-eastus-2
Public IP: null
Private IP: 10.0.0.6
State: terminated
Shards: 7

About 2 hours later, 2 nemesis that issues a rolling restart of all cluster nodes are executed:

disrupt_rolling_restart_cluster	longevity-10gb-3h-master-db-node-413a3a9b-eastus-11	Succeeded	2024-12-18 03:55:11	2024-12-18 03:57:50
disrupt_rolling_config_change_internode_compression	longevity-10gb-3h-master-db-node-413a3a9b-eastus-4	Succeeded	2024-12-18 04:00:20	2024-12-18 04:08:15

Then c-s got errors for node-2 (that was decommissioned ~ 2 hours ago) like:

WARN  [cluster1-nio-worker-2] 2024-12-18 03:56:11,975 DefaultPromise.java:593 - An exception was thrown by com.datastax.driver.core.Connection$ChannelCloseListener.operationComplete()
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
java.lang.AssertionError: null
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.driver.core.ConvictionPolicy$DefaultConvictionPolicy.signalConnectionFailure(ConvictionPolicy.java:101)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.driver.core.Connection.defunct(Connection.java:812)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.driver.core.Connection$ChannelCloseListener.operationComplete(Connection.java:1667)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.driver.core.Connection$ChannelCloseListener.operationComplete(Connection.java:1657)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:625)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:105)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed

Multiple connection errors (that might be expected) are reported for some other nodes as well, during these rolling restarts:

WARN  [cluster1-nio-worker-5] 2024-12-18 03:55:13,594 Connection.java:284 - Error creating netty channel to /10.0.0.8:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.0.0.8:9042
Caused by: java.net.ConnectException: Connection refused
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
        at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
        at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Unknown Source)
WARN  [cluster1-nio-worker-6] 2024-12-18 03:55:15,606 Connection.java:284 - Error creating netty channel to /10.0.0.8:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.0.0.8:9042
Caused by: java.net.ConnectException: Connection refused
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
        at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
        at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Unknown Source)
total,     769417033,   78364,   78364,   78364,    12.8,     7.8,    40.2,    62.0,    90.7,   148.6, 9965.0,  0.00209,      0,      0,       0,       0,       0,       0
WARN  [cluster1-nio-worker-0] 2024-12-18 03:55:19,611 Connection.java:284 - Error creating netty channel to /10.0.0.8:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.0.0.8:9042
...
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.9:9042] Connection has been closed
...
WARN  [cluster1-nio-worker-0] 2024-12-18 03:55:30,963 Connection.java:284 - Error creating netty channel to /10.0.0.9:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.0.0.9:9042
Caused by: java.net.ConnectException: Connection refused

Another unclear connection issues are found for node-10 that was decommissioned at 03:22 -

longevity-10gb-3h-master-db-node-413a3a9b-eastus-10 [None | 10.0.0.14]
live time:
	2024-12-18 03:12:00	2024-12-18 03:20:08

node-10 removal:

< t:2024-12-18 03:19:34,951 f:cluster.py      l:1169 c:sdcm.cluster_azure   p:INFO  > Node longevity-10gb-3h-master-db-node-413a3a9b-eastus-10 [None | 10.0.0.14]: Set termination_event
< t:2024-12-18 03:20:08,988 f:cluster.py      l:1231 c:sdcm.cluster         p:INFO  > Node longevity-10gb-3h-master-db-node-413a3a9b-eastus-10 [None | 10.0.0.14] destroyed

connection errors:

WARN  [cluster1-nio-worker-2] 2024-12-18 04:16:37,254 Connection.java:284 - Error creating netty channel to /10.0.0.14:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.0.0.14:9042
Caused by: java.net.NoRouteToHostException: No route to host
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
        at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
        at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Unknown Source)
...

WARN  [cluster1-nio-worker-3] 2024-12-18 04:26:40,326 Connection.java:284 - Error creating netty channel to /10.0.0.14:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.0.0.14:9042
Caused by: java.net.NoRouteToHostException: No route to host
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
        at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
        at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Unknown Source)
...

WARN  [cluster1-nio-worker-5] 2024-12-18 04:36:43,398 Connection.java:284 - Error creating netty channel to /10.0.0.14:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.0.0.14:9042
Caused by: java.net.NoRouteToHostException: No route to host
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
        at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
        at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Unknown Source)

node-10 private address is also reported long afterwards:

< t:2024-12-18 03:55:54,995 f:db_log_reader.py l:125  c:sdcm.db_log_reader   p:DEBUG > 2024-12-18T03:55:54.979+00:00 longevity-10gb-3h-master-db-node-413a3a9b-eastus-7     !INFO | scylla[5195]:  [shard 0:main] messaging_service - Stopping client for address: 10.0.0.14:0
< t:2024-12-18 03:55:55,016 f:db_log_reader.py l:125  c:sdcm.db_log_reader   p:DEBUG > 2024-12-18T03:55:54.980+00:00 longevity-10gb-3h-master-db-node-413a3a9b-eastus-7     !INFO | scylla[5195]:  [shard 0:main] messaging_service - Stopping client for address: 10.0.0.14:0 - Done

The cassandra-stress eventually failed the test for an error of:

2024-12-18 04:43:29.357: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=36f22416-2cde-4c0c-bccd-2deeac0e027f duration=3h34m30s: node=Node longevity-10gb-3h-master-loader-node-413a3a9b-eastus-1 [None | 10.0.0.11]
stress_cmd=cassandra-stress write cl=QUORUM duration=180m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native -rate threads=1000 -pop seq=1..10000000 -log interval=5
errors:
Stress command execution failed with: Command did not complete within 12870 seconds!
Command: 'sudo  docker exec 6b4a56b5947dea0840cfcb21e1e974a93072cc15cb75c103041fb8075e6f6db5 /bin/sh -c \'echo TAG: loader_idx:1-cpu_idx:0-keyspace_idx:1; STRESS_TEST_MARKER=HITDHJBU91WD2JHM5LG4; cassandra-stress write no-warmup cl=QUORUM duration=180m -schema keyspace=keyspace1 \'"\'"\'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)\'"\'"\' -mode cql3 native -rate threads=1000 -pop seq=1..10000000 -log interval=5 -node 10.0.0.5,10.0.0.6,10.0.0.7,10.0.0.8,10.0.0.9,10.0.0.10 -errors skip-unsupported-columns\''
Stdout:
at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
INFO  [cluster1-reconnection-1] 2024-12-18 04:37:08,585 HostConnectionPool.java:200 - Using advanced port-based shard awareness with /10.0.0.8:9042
Stderr:
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (Standard_L8s_v3)

Scylla Nodes used in this run:

  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-9 (null | 10.0.0.5) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-8 (null | 10.0.0.6) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-7 (null | 10.0.0.7) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-6 (null | 10.0.0.10) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-5 (null | 10.0.0.9) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-4 (null | 10.0.0.8) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-3 (null | 10.0.0.7) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-2 (null | 10.0.0.6) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-11 (null | 10.0.0.10) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-10 (null | 10.0.0.14) (shards: -1)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-1 (null | 10.0.0.5) (shards: 7)

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-6.3.0-dev-x86_64-2024-12-18T02-02-40 (azure: undefined_region)

Test: longevity-10gb-3h-azure-test Test id: 413a3a9b-fe7b-4e5e-b864-6f1f26628226 Test name: scylla-master/longevity/longevity-10gb-3h-azure-test Test method: longevity_test.LongevityTest.test_custom_time Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 413a3a9b-fe7b-4e5e-b864-6f1f26628226
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 413a3a9b-fe7b-4e5e-b864-6f1f26628226

Logs:

Jenkins job URL Argus

yarongilor avatar Dec 19 '24 11:12 yarongilor

Scylla version: 6.3.0~dev-20250115.72f28ce81e17 with build-id d583b7e768996945fc24bf7f697f5ec41306253a

Kernel Version: 6.8.0-1020-gcp

Issue description

Reproduced again. Cassandra stress test tries to connect to decommissioned nodes.

  • node 10.142.0.40 - terminated at 02:24:06 | connection attempt at 03:40:14
  • node 10.142.0.53 - terminated at 01:55:40 | connection attempt at 03:40:23
  • node 10.142.0.59 - terminated at 01:15:19 | connection attempt at 03:40:28

This list contains just 3 nodes for illustration—more nodes terminated before connection. Please take a look at the logs for more details.

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/609539021565457724 (NO RUNNER: NO RUNNER)

Test: longevity-10gb-3h-gce-test Test id: 09a348c2-c651-412a-8600-d824f4e088c4 Test name: scylla-master/longevity/longevity-10gb-3h-gce-test Test method: longevity_test.LongevityTest.test_custom_time Test config file: longevity-10gb-3h.yaml

Jenkins job URL Argus

WARN  [cluster1-nio-worker-0] 2025-01-16 03:40:18,779 Connection.java:284 - Error creating netty channel to /10.142.0.40:9042
com.datastax.shaded.netty.channel.ConnectTimeoutException: connection timed out: /10.142.0.40:9042
	at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261)
	at com.datastax.shaded.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at com.datastax.shaded.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
	at com.datastax.shaded.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at com.datastax.shaded.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)

WARN  [cluster1-nio-worker-0] 2025-01-16 03:40:23,785 Connection.java:284 - Error creating netty channel to /10.142.0.53:9042
com.datastax.shaded.netty.channel.ConnectTimeoutException: connection timed out: /10.142.0.53:9042
	at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261)
	at com.datastax.shaded.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at com.datastax.shaded.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
	at com.datastax.shaded.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at com.datastax.shaded.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)

WARN  [cluster1-nio-worker-7] 2025-01-16 03:40:28,792 Connection.java:284 - Error creating netty channel to /10.142.0.59:9042
com.datastax.shaded.netty.channel.ConnectTimeoutException: connection timed out: /10.142.0.59:9042
	at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261)
	at com.datastax.shaded.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at com.datastax.shaded.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
	at com.datastax.shaded.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at com.datastax.shaded.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)

jsmolar avatar Jan 16 '25 09:01 jsmolar

The issue seem to reproduce now in a more specific scenario. It takes about half an hours. perhaps it can even narrow down further more. The current scenario is:

  • decommission node-6
  • (Run disrupt_bootstrap_streaming_error - might be unneeded).
  • Start rolling restart cluster (disrupt_rolling_restart_cluster).
  • driver got the error of "Connection has been closed" to node-6.

hmmm, i see an unexpected log message of restarting the decommissioned node as well (or is it a different node with the same ip?):

< t:2025-01-26 13:34:38,694 f:db_log_reader.py l:125  c:sdcm.db_log_reader   p:DEBUG > 2025-01-26T13:34:36.966+00:00 longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-5     !INFO | scylla[5454]:  [shard 0: gms] gossip - Node 10.0.0.10 has restarted, now UP, status = NORMAL

it could be node-7:

< t:2025-01-26 13:15:09,905 f:cluster.py      l:446  c:sdcm.cluster_azure   p:DEBUG > Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-7 [None | 10.0.0.10]: SSH access -> 'ssh -i ~/.ssh/scylla_test_id_ed25519 [email protected]'
egrep "\[/10.0.0.10:9042\] Connection has been closed|sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey" csct 
< t:2025-01-26 13:06:06,709 f:tester.py       l:1207 c:LongevityTest        p:DEBUG > Nemesis threads [{'nemesis': <class 'sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey'>, 'nemesis_selector': []}]
< t:2025-01-26 13:06:56,977 f:nemesis.py      l:467  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Interval: 120 s
< t:2025-01-26 13:07:11,655 f:nemesis.py      l:416  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Target node pool type: NEMESIS_TARGET_POOLS.data_nodes
< t:2025-01-26 13:07:11,655 f:nemesis.py      l:1196 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Set current_disruption -> NodetoolDecommission
< t:2025-01-26 13:07:11,655 f:nemesis.py      l:459  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Current Target: Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-6 [None | 10.0.0.10] with running nemesis: NodetoolDecommission
< t:2025-01-26 13:07:11,724 f:nemesis.py      l:5474 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: >>>>>>>>>>>> Started disruption disrupt_nodetool_decommission (NodetoolDecommission nemesis) on the target node 'Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-6 [None | 10.0.0.10]' >>>>>>>>>>>>
< t:2025-01-26 13:08:38,108 f:nemesis.py      l:1330 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Adding 1 new nodes to cluster...
< t:2025-01-26 13:17:51,276 f:nemesis.py      l:5537 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: NodetoolDecommission duration -> 639 s
< t:2025-01-26 13:17:51,276 f:nemesis.py      l:362  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Update nemesis info with: {'start': 1737896831, 'end': 1737897471, 'duration': 639, 'node': 'Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-6 [None | 10.0.0.10]', 'subtype': 'end'}
< t:2025-01-26 13:17:51,820 f:nemesis.py      l:5552 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: log_info: {'start': 1737896831, 'end': 1737897471, 'duration': 639, 'node': 'Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-6 [None | 10.0.0.10]', 'subtype': 'end'}
< t:2025-01-26 13:17:52,163 f:nemesis.py      l:5561 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: <<<<<<<<<<<< Finished disruption disrupt_nodetool_decommission (NodetoolDecommission nemesis) with status 'succeeded' <<<<<<<<<<<<
< t:2025-01-26 13:18:23,120 f:nemesis.py      l:416  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Target node pool type: NEMESIS_TARGET_POOLS.data_nodes
< t:2025-01-26 13:18:23,120 f:nemesis.py      l:1196 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Set current_disruption -> BootstrapStreamingError
< t:2025-01-26 13:18:23,120 f:nemesis.py      l:459  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Current Target: Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-2 [None | 10.0.0.6] with running nemesis: BootstrapStreamingError
< t:2025-01-26 13:18:23,168 f:nemesis.py      l:5474 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: >>>>>>>>>>>> Started disruption disrupt_bootstrap_streaming_error (BootstrapStreamingError nemesis) on the target node 'Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-2 [None | 10.0.0.6]' >>>>>>>>>>>>
< t:2025-01-26 13:26:39,461 f:nemesis.py      l:5301 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Wait 5 minutes with new topology
< t:2025-01-26 13:31:39,533 f:nemesis.py      l:5303 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Decommission added node
< t:2025-01-26 13:33:35,527 f:nemesis.py      l:5537 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: BootstrapStreamingError duration -> 912 s
< t:2025-01-26 13:33:35,527 f:nemesis.py      l:362  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Update nemesis info with: {'start': 1737897503, 'end': 1737898415, 'duration': 912, 'node': 'Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-2 [None | 10.0.0.6]', 'subtype': 'end'}
< t:2025-01-26 13:33:35,585 f:nemesis.py      l:5552 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: log_info: {'start': 1737897503, 'end': 1737898415, 'duration': 912, 'node': 'Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-2 [None | 10.0.0.6]', 'subtype': 'end'}
< t:2025-01-26 13:33:35,924 f:nemesis.py      l:5561 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: <<<<<<<<<<<< Finished disruption disrupt_bootstrap_streaming_error (BootstrapStreamingError nemesis) with status 'succeeded' <<<<<<<<<<<<
< t:2025-01-26 13:34:06,615 f:nemesis.py      l:416  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Target node pool type: NEMESIS_TARGET_POOLS.data_nodes
< t:2025-01-26 13:34:06,615 f:nemesis.py      l:1196 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Set current_disruption -> RollingRestartCluster
< t:2025-01-26 13:34:06,615 f:nemesis.py      l:459  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Current Target: Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-7 [None | 10.0.0.10] with running nemesis: RollingRestartCluster
< t:2025-01-26 13:34:06,665 f:nemesis.py      l:5474 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: >>>>>>>>>>>> Started disruption disrupt_rolling_restart_cluster (RollingRestartCluster nemesis) on the target node 'Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-7 [None | 10.0.0.10]' >>>>>>>>>>>>
< t:2025-01-26 13:35:32,653 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.0.0.12>: com.datastax.driver.core.exceptions.TransportException: [/10.0.0.10:9042] Connection has been closed
< t:2025-01-26 13:35:32,653 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.0.0.12>: com.datastax.driver.core.exceptions.TransportException: [/10.0.0.10:9042] Connection has been closed
< t:2025-01-26 13:35:32,653 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.0.0.12>: com.datastax.driver.core.exceptions.TransportException: [/10.0.0.10:9042] Connection has been closed
< t:2025-01-26 13:35:32,653 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.0.0.12>: com.datastax.driver.core.exceptions.TransportException: [/10.0.0.10:9042] Connection has been closed

Packages

Scylla version: 2025.1.0~dev-20250125.f4b1ad43d4e7 with build-id f287d3c3c2679d52454be4519393ac84ecdaf2c8

Kernel Version: 6.8.0-1020-azure

Issue description

  • [ ] This issue is a regression.
  • [ ] It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (Standard_L8s_v3)

Scylla Nodes used in this run:

  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-9 (null | 10.0.0.14) (shards: 7)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-8 (null | 10.0.0.14) (shards: -1)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-7 (null | 10.0.0.10) (shards: 7)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-6 (null | 10.0.0.10) (shards: 7)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-5 (null | 10.0.0.9) (shards: 7)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-4 (null | 10.0.0.8) (shards: 7)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-3 (null | 10.0.0.7) (shards: 7)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-2 (null | 10.0.0.6) (shards: 7)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-16 (null | 10.0.0.9) (shards: -1)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-15 (null | 10.0.0.6) (shards: 7)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-14 (null | 10.0.0.9) (shards: -1)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-13 (null | 10.0.0.14) (shards: 7)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-12 (null | 10.0.0.9) (shards: -1)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-11 (null | 10.0.0.6) (shards: 7)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-10 (null | 10.0.0.6) (shards: -1)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-1 (null | 10.0.0.5) (shards: 7)

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-2025.1.0-dev-x86_64-2025-01-25T21-46-28 (NO RUNNER: NO RUNNER)

Test: longevity-10gb-3h-azure-repro-java-driver Test id: ed6aec62-676f-45d4-b256-17077796046b Test name: scylla-master/reproducers/longevity-10gb-3h-azure-repro-java-driver Test method: longevity_test.LongevityTest.test_custom_time Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor ed6aec62-676f-45d4-b256-17077796046b
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs ed6aec62-676f-45d4-b256-17077796046b

Logs:

Jenkins job URL Argus

yarongilor avatar Jan 26 '25 16:01 yarongilor

Are you sure it's not related to this one: https://github.com/scylladb/scylla-cluster-tests/issues/8855#issuecomment-2614500840

roydahan avatar Jan 26 '25 21:01 roydahan

This issue is not accurately reported. It looks like the decommissioned node ip is reused for a new added node, so the driver not necessarily got decommissioned node connectivity errors. more likely it's the new node restarted by nemesis.

< t:2024-12-18 02:03:52,813 f:cluster.py      l:2209 c:sdcm.cluster_azure   p:INFO  > Node longevity-10gb-3h-master-db-node-413a3a9b-eastus-8 [None | 10.0.0.6]: Found ScyllaDB version: 6.3.0~dev

yarongilor avatar Jan 28 '25 14:01 yarongilor

@yarongilor , could you please confirm that there are two issues:

  1. That driver trying to reconnect to decomissioned node
  2. c-s get stuck and does not finish in time.

dkropachev avatar Jan 28 '25 15:01 dkropachev

@yarongilor , could you please confirm that there are two issues:

  1. That driver trying to reconnect to decomissioned node
  2. c-s get stuck and does not finish in time.

As for the first one - i think it may be a false alarm, since the driver tried connecting to the ip of a new node, which happen to be identical to the decommissioned node. As for 2, i'm not sure what happened. grafana does show a drop from 150k ops to ~ 60k after a rolling nemesis:

Image

So it might be the something wrong happened to the driver following the 2 rolling-restart nemesis. There's also an error that didn't happen before, like:

WARN  [cluster1-nio-worker-3] 2024-12-18 03:57:10,443 DefaultPromise.java:593 - An exception was thrown by com.datastax.driver.core.Connection$ChannelCloseListener.operationComplete()

yarongilor avatar Jan 28 '25 17:01 yarongilor

As for the first one - i think it may be a false alarm, since the driver tried connecting to the ip of a new node, which happen to be identical to the decommissioned node.

Reproduced again. Cassandra stress test tries to connect to decommissioned nodes.

* node 10.142.0.40 - terminated at 02:24:06 | connection attempt at 03:40:14

* node 10.142.0.53 - terminated at 01:55:40 | connection attempt at 03:40:23

* node 10.142.0.59 - terminated at 01:15:19 | connection attempt at 03:40:28

I believe it is not false alarm, we observe in many different runs (might not reproduce in your particular one) that C-S tried to access nodes that were terminated 2 hours ago

pehala avatar Feb 04 '25 15:02 pehala