java-driver [c-s][3.x] driver shutdown stuck

Issue description

[ ] This issue is a regression.
[x] It is unknown if this issue is a regression.

During the end of the stress command one node is gone (10.0.1.17), and this driver shutdown seems to be stuck

total,       4425734,   19327,   19327,   19327,     4.1,     3.9,     6.9,     9.9,    24.0,    58.8,  325.0,  0.03100,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
total,       4508016,   16456,   16456,   16456,     4.8,     4.0,     9.6,    19.7,    91.6,   218.0,  330.0,  0.03065,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Channel has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Channel has been closed
total,       4583006,   14998,   14998,   14998,     5.3,     4.5,    11.2,    18.1,    31.7,    62.0,  335.0,  0.03018,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Channel has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
total,       4677544,   18908,   18908,   18908,     4.2,     4.0,     6.4,     9.0,    14.4,    20.2,  340.0,  0.03007,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
total,       4780430,   20577,   20577,   20577,     3.9,     3.7,     6.6,     8.5,    13.4,    16.5,  345.0,  0.03022,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
total,       4882713,   20457,   20457,   20457,     3.9,     3.7,     6.6,     9.1,    14.4,    26.8,  350.0,  0.03032,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
total,       4985711,   20600,   20600,   20600,     3.9,     3.7,     6.6,     9.0,    13.9,    23.9,  355.0,  0.03040,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
total,       5048570,   20374,   20374,   20374,     3.9,     3.7,     6.7,     9.2,    19.2,    24.3,  358.1,  0.03041,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing

com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing

com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
Results:
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Op rate                   :   14,099 op/s  [WRITE: 14,099 op/s]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Partition rate            :   14,099 pk/s  [WRITE: 14,099 pk/s]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Row rate                  :   14,099 row/s [WRITE: 14,099 row/s]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Latency mean              :    5.6 ms [WRITE: 5.6 ms]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Latency median            :    4.8 ms [WRITE: 4.8 ms]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Latency 95th percentile   :   11.6 ms [WRITE: 11.6 ms]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Latency 99th percentile   :   20.7 ms [WRITE: 20.7 ms]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Latency 99.9th percentile :   41.6 ms [WRITE: 41.6 ms]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Latency max               :  365.7 ms [WRITE: 365.7 ms]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Total partitions          :  5,048,570 [WRITE: 5,048,570]
WARN  00:44:50,156 Error creating netty channel to ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042
Total errors              :          0 [WRITE: 0]
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042
Total GC count            : 0
Caused by: java.net.ConnectException: Connection refused
Total GC memory           : 0.000 KiB
	at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
Total GC time             :    0.0 seconds
	at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
Avg GC time               :    NaN ms
	at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
StdDev GC time            :    0.0 ms
	at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
Total operation time      : 00:05:58
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)

	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
END

Impact

Stress command is getting stuck

How frequently does it reproduce?

seen once so far on the nightly runs

Installation details

Kernel Version: 5.15.0-1049-aws Scylla version (or git commit hash): 5.5.0~dev-20231113.7b08886e8dd8 with build-id 7548c48606c5b58f282fc2b596019226de0df6ed

Cluster size: 3 nodes (i4i.large)

Scylla Nodes used in this run:

longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-9 (13.49.80.230 | 10.0.1.175) (shards: 2)
longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-8 (51.20.109.226 | 10.0.3.74) (shards: 2)
longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-7 (16.16.201.232 | 10.0.1.202) (shards: 2)
longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-6 (13.51.48.183 | 10.0.1.141) (shards: 2)
longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-5 (51.20.123.34 | 10.0.2.227) (shards: 2)
longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-4 (13.51.178.61 | 10.0.1.211) (shards: 2)
longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-3 (16.171.236.237 | 10.0.1.17) (shards: 2)
longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-2 (51.20.144.28 | 10.0.1.61) (shards: 2)
longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-17 (16.171.239.81 | 10.0.2.188) (shards: 2)
longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-16 (16.171.54.15 | 10.0.1.41) (shards: 2)
longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-15 (51.20.143.39 | 10.0.3.169) (shards: 2)
longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-14 (16.16.64.126 | 10.0.0.194) (shards: 2)
longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-13 (16.170.157.28 | 10.0.2.96) (shards: 2)
longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-12 (51.20.66.232 | 10.0.2.150) (shards: 2)
longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-11 (13.48.26.181 | 10.0.1.121) (shards: 2)
longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-10 (51.20.106.233 | 10.0.1.211) (shards: 2)
longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-1 (16.171.110.198 | 10.0.2.74) (shards: 2)

OS / Image: ami-02ea62bee686ef03b (aws: undefined_region)

Test: longevity-5gb-1h-NodeTerminateAndReplace-aws-test Test id: dfb19859-9fbb-4408-a2be-a42c7400f7cb Test name: scylla-master/nemesis/longevity-5gb-1h-NodeTerminateAndReplace-aws-test Test config file(s):

longevity-5gb-1h-NodeTerminateAndReplace.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor dfb19859-9fbb-4408-a2be-a42c7400f7cb
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs dfb19859-9fbb-4408-a2be-a42c7400f7cb

Logs:

db-cluster-dfb19859.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/db-cluster-dfb19859.tar.gz
sct-runner-events-dfb19859.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/sct-runner-events-dfb19859.tar.gz
sct-dfb19859.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/sct-dfb19859.log.tar.gz
loader-set-dfb19859.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/loader-set-dfb19859.tar.gz
monitor-set-dfb19859.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/monitor-set-dfb19859.tar.gz

Jenkins job URL Argus

Nov 15 '23 16:11 fruch

Got similar problem after DecommissionStreamingErr nemesis during longevity-10gb-3h-azure-test (10.0.0.5 is a target node for this nemesis):

TAG: loader_idx:1-cpu_idx:0-keyspace_idx:1
08:37:19,314 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found resource [logback-tools.xml] at [file:/etc/scylla/cassandra/logback-tools.xml]
08:37:19,315 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback-tools.xml] occurs multiple times on the classpath.
08:37:19,315 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback-tools.xml] occurs at [file:/logback-tools.xml]
08:37:19,315 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback-tools.xml] occurs at [file:/etc/scylla/cassandra/logback-tools.xml]
08:37:19,412 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set
08:37:19,413 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
08:37:19,420 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDERR]
08:37:19,428 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
08:37:19,489 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to WARN
08:37:19,489 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDERR] to Logger[ROOT]
08:37:19,489 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.
08:37:19,490 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@2781e022 - Registering current configuration as safe fallback point

******************** Stress Settings ********************
Command:
  Type: write
  Count: -1
  Duration: 180 MINUTES

...

===== Using optimized driver!!! =====
Connected to cluster: longevity-10gb-3h-master-db-cluster-6c30fd46, max pending requests per connection null, max connections per host 8
Datatacenter: eastus; Host: /10.0.0.5; Rack: eastus
Datatacenter: eastus; Host: /10.0.0.6; Rack: eastus
Datatacenter: eastus; Host: /10.0.0.7; Rack: eastus
Datatacenter: eastus; Host: /10.0.0.8; Rack: eastus
Datatacenter: eastus; Host: /10.0.0.9; Rack: eastus
Datatacenter: eastus; Host: /10.0.0.10; Rack: eastus
Created keyspaces. Sleeping 6s for propagation.
Sleeping 2s...
Running WRITE with 1000 threads 180 minutes
type       total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
Failed to connect over JMX; not collecting these stats
total,        132151,   26430,   26430,   26430,    33.3,    19.7,   109.8,   185.7,   335.8,   459.8,    5.0,  0.00000,      0,      0,       0,       0,       0,       0
total,        366429,   46856,   46856,   46856,    21.3,    17.7,    51.7,    73.6,   110.0,   154.9,   10.0,  0.17570,      0,      0,       0,       0,       0,       0

...

total,     662360477,   78539,   78539,   78539,    12.7,     6.7,    45.6,    75.8,   114.6,   161.6,10800.0,  0.00426,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
total,     662377070,   63469,   63469,   63469,    14.8,     9.3,    46.0,    68.8,    89.3,   100.1,10800.3,  0.00427,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response

com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response

com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Results:
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Op rate                   :   61,330 op/s  [WRITE: 61,330 op/s]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Partition rate            :   61,330 pk/s  [WRITE: 61,330 pk/s]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Row rate                  :   61,330 row/s [WRITE: 61,330 row/s]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency mean              :   16.3 ms [WRITE: 16.3 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency median            :    6.8 ms [WRITE: 6.8 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency 95th percentile   :   61.4 ms [WRITE: 61.4 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency 99th percentile   :  149.6 ms [WRITE: 149.6 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency 99.9th percentile :  488.1 ms [WRITE: 488.1 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency max               : 42312.1 ms [WRITE: 42,312.1 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total partitions          : 662,377,070 [WRITE: 662,377,070]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total errors              :          0 [WRITE: 0]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total GC count            : 0
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total GC memory           : 0.000 KiB
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total GC time             :    0.0 seconds
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Avg GC time               :    NaN ms
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
StdDev GC time            :    0.0 ms
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total operation time      : 03:00:00
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response

com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
END
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response

...

com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Connection has been closed

...

com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Connection has been closed

Installation details

Kernel Version: 5.15.0-1051-azure Scylla version (or git commit hash): 5.5.0~dev-20231118.6bcf3ac86c64 with build-id 61307c96d481b470546e0fd8d94a69a544c4f05c

Cluster size: 6 nodes (Standard_L8s_v3)

Scylla Nodes used in this run:

longevity-10gb-3h-master-db-node-6c30fd46-eastus-7 (172.191.228.136 | 10.0.0.9) (shards: 7)
longevity-10gb-3h-master-db-node-6c30fd46-eastus-6 (172.173.140.123 | 10.0.0.10) (shards: 7)
longevity-10gb-3h-master-db-node-6c30fd46-eastus-5 (172.173.140.36 | 10.0.0.9) (shards: 7)
longevity-10gb-3h-master-db-node-6c30fd46-eastus-4 (172.173.140.33 | 10.0.0.8) (shards: 7)
longevity-10gb-3h-master-db-node-6c30fd46-eastus-3 (172.173.140.15 | 10.0.0.7) (shards: 7)
longevity-10gb-3h-master-db-node-6c30fd46-eastus-2 (172.173.139.243 | 10.0.0.6) (shards: 7)
longevity-10gb-3h-master-db-node-6c30fd46-eastus-1 (172.173.139.184 | 10.0.0.5) (shards: 7)

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-5.5.0-dev-x86_64-2023-11-18T06-26-15 (azure: undefined_region)

Test: longevity-10gb-3h-azure-test Test id: 6c30fd46-3432-426f-824f-362eb6107b4b Test name: scylla-master/longevity/longevity-10gb-3h-azure-test Test config file(s):

longevity-10gb-3h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 6c30fd46-3432-426f-824f-362eb6107b4b
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 6c30fd46-3432-426f-824f-362eb6107b4b

Logs:

db-cluster-6c30fd46.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/db-cluster-6c30fd46.tar.gz
sct-runner-events-6c30fd46.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/sct-runner-events-6c30fd46.tar.gz
sct-6c30fd46.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/sct-6c30fd46.log.tar.gz
loader-set-6c30fd46.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/loader-set-6c30fd46.tar.gz
monitor-set-6c30fd46.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/monitor-set-6c30fd46.tar.gz
parallel-timelines-report-6c30fd46.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/parallel-timelines-report-6c30fd46.tar.gz

Jenkins job URL Argus

The job is failing like this at least since build 244.

Nov 20 '23 12:11 enaydanov

@avelanarius - is anyone looking at this?

Dec 12 '23 07:12 mykaul

@avelanarius - is anyone looking at this?

Currently no one, with other tasks taking priority.

Dec 12 '23 09:12 avelanarius

cassandra-stress hangs after finishing it's stress work:

total,     527518456,   55284,   55284,   55284,    18.3,    12.6,    55.1,    83.6,   105.1,   116.9,10800.9,  0.11788,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed

com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed

com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Results:
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Op rate                   :   48,840 op/s  [WRITE: 48,840 op/s]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Partition rate            :   48,840 pk/s  [WRITE: 48,840 pk/s]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Row rate                  :   48,840 row/s [WRITE: 48,840 row/s]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency mean              :   20.3 ms [WRITE: 20.3 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency median            :   12.7 ms [WRITE: 12.7 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency 95th percentile   :   55.2 ms [WRITE: 55.2 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency 99th percentile   :   87.8 ms [WRITE: 87.8 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency 99.9th percentile :  169.2 ms [WRITE: 169.2 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency max               : 90664.1 ms [WRITE: 90,664.1 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total partitions          : 527,518,456 [WRITE: 527,518,456]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total errors              :          0 [WRITE: 0]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total GC count            : 0
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total GC memory           : 0.000 KiB
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total GC time             :    0.0 seconds
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Avg GC time               :    NaN ms
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
StdDev GC time            :    0.0 ms
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total operation time      : 03:00:00
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed

com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
END
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
...

Installation details

Kernel Version: 5.15.0-1048-gcp Scylla version (or git commit hash): 2024.1.0~rc2-20231217.f57117d9cfe3 with build-id 3a4d2dfe8ef4eef5454badb34d1710a5f36a859c

Cluster size: 6 nodes (n2-highmem-16)

Scylla Nodes used in this run:

longevity-10gb-3h-2024-1-db-node-f4ac5c0d-0-8 (34.74.6.225 | 10.142.0.5) (shards: 14)
longevity-10gb-3h-2024-1-db-node-f4ac5c0d-0-7 (35.237.83.141 | 10.142.0.22) (shards: 14)
longevity-10gb-3h-2024-1-db-node-f4ac5c0d-0-6 (35.196.144.201 | 10.142.0.118) (shards: 14)
longevity-10gb-3h-2024-1-db-node-f4ac5c0d-0-5 (34.74.203.113 | 10.142.0.121) (shards: 14)
longevity-10gb-3h-2024-1-db-node-f4ac5c0d-0-4 (34.139.221.96 | 10.142.0.120) (shards: 14)
longevity-10gb-3h-2024-1-db-node-f4ac5c0d-0-3 (35.231.239.128 | 10.142.0.119) (shards: 14)
longevity-10gb-3h-2024-1-db-node-f4ac5c0d-0-2 (34.23.227.220 | 10.142.0.117) (shards: 14)
longevity-10gb-3h-2024-1-db-node-f4ac5c0d-0-1 (34.23.49.11 | 10.142.0.68) (shards: 14)

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/2925864426790911027 (gce: undefined_region)

Test: longevity-10gb-3h-gce-test Test id: f4ac5c0d-4243-4afe-9c28-b9e5fe051b92 Test name: enterprise-2024.1/longevity/longevity-10gb-3h-gce-test Test config file(s):

longevity-10gb-3h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor f4ac5c0d-4243-4afe-9c28-b9e5fe051b92
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs f4ac5c0d-4243-4afe-9c28-b9e5fe051b92

Logs:

db-cluster-f4ac5c0d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/db-cluster-f4ac5c0d.tar.gz
sct-runner-events-f4ac5c0d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/sct-runner-events-f4ac5c0d.tar.gz
sct-f4ac5c0d.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/sct-f4ac5c0d.log.tar.gz
loader-set-f4ac5c0d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/loader-set-f4ac5c0d.tar.gz
monitor-set-f4ac5c0d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/monitor-set-f4ac5c0d.tar.gz
parallel-timelines-report-f4ac5c0d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/parallel-timelines-report-f4ac5c0d.tar.gz

Jenkins job URL Argus

Dec 22 '23 16:12 vponomaryov

One more repro here:

Installation details

Kernel Version: 5.15.0-1045-aws Scylla version (or git commit hash): 2023.1.2-20231001.646df23cc4b3 with build-id 367fcf1672d44f5cbddc88f946cf272e2551b85a Target Scylla version (or git commit hash): 2024.1.0~rc2-20231217.f57117d9cfe3 with build-id 3a4d2dfe8ef4eef5454badb34d1710a5f36a859c

Cluster size: 6 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

multi-dc-rolling-upgrade--ubuntu-fo-db-node-66b220bf-6 (3.9.144.36 | 10.3.1.251) (shards: 7)
multi-dc-rolling-upgrade--ubuntu-fo-db-node-66b220bf-5 (18.170.227.216 | 10.3.3.49) (shards: 7)
multi-dc-rolling-upgrade--ubuntu-fo-db-node-66b220bf-4 (3.8.120.31 | 10.3.0.150) (shards: 7)
multi-dc-rolling-upgrade--ubuntu-fo-db-node-66b220bf-3 (34.245.175.43 | 10.4.1.178) (shards: 7)
multi-dc-rolling-upgrade--ubuntu-fo-db-node-66b220bf-2 (52.19.224.202 | 10.4.3.109) (shards: 7)
multi-dc-rolling-upgrade--ubuntu-fo-db-node-66b220bf-1 (63.35.164.254 | 10.4.1.139) (shards: 7)

OS / Image: ami-0bca3ed0b1ef2f2fa ami-001b0989b7cf54ded (aws: undefined_region)

Test: rolling-upgrade-multidc-test Test id: 66b220bf-623f-460a-bd36-d6fa1dc496ba Test name: enterprise-2024.1/rolling-upgrade/rolling-upgrade-multidc-test Test config file(s):

multi-dc-rolling-upgrade.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 66b220bf-623f-460a-bd36-d6fa1dc496ba
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 66b220bf-623f-460a-bd36-d6fa1dc496ba

Logs:

db-cluster-66b220bf.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/66b220bf-623f-460a-bd36-d6fa1dc496ba/20231219_175710/db-cluster-66b220bf.tar.gz
sct-runner-events-66b220bf.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/66b220bf-623f-460a-bd36-d6fa1dc496ba/20231219_175710/sct-runner-events-66b220bf.tar.gz
sct-66b220bf.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/66b220bf-623f-460a-bd36-d6fa1dc496ba/20231219_175710/sct-66b220bf.log.tar.gz
loader-set-66b220bf.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/66b220bf-623f-460a-bd36-d6fa1dc496ba/20231219_175710/loader-set-66b220bf.tar.gz
monitor-set-66b220bf.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/66b220bf-623f-460a-bd36-d6fa1dc496ba/20231219_175710/monitor-set-66b220bf.tar.gz

Jenkins job URL Argus

Dec 22 '23 16:12 vponomaryov

the last report from @vponomaryov is something that is reproducible running it now with c-s debug enabled

https://argus.scylladb.com/test/7748483c-dd44-4306-9e69-97c97214f482/runs?additionalRuns[]=a242eed1-bb86-48e4-b27b-aec40cbb23db

Dec 27 '23 20:12 fruch

@avelanarius / @Bouncheck let's try to at least triage it ASAP. We have a quite easy way to reproduce it, let's try to understand where the issue is and if it's a recent regression or not.

Jan 02 '24 22:01 roydahan

I haven't understood the root cause yet, however I have a modification in mind that may help with ensuring that shutdown happens. I'll consult @avelanarius first.

Jan 08 '24 02:01 Bouncheck

I relaunched Jenkins reproducer with DEBUG logs: https://jenkins.scylladb.com/job/enterprise-2024.1/job/reproducers/job/rolling-upgrade-multidc-test/5/. The run mentioned in here:

the last report from @vponomaryov is something that is reproducible running it now with c-s debug enabled https://argus.scylladb.com/test/7748483c-dd44-4306-9e69-97c97214f482/runs?additionalRuns[]=a242eed1-bb86-48e4-b27b-aec40cbb23db

didn't yield useful logs (we're not even sure that the actual problem reproduced...)

Jan 08 '24 12:01 avelanarius

@Bouncheck I see that the latest run reproduced the problem correctly and now there are good DEBUG logs to try to understand the situation - file cassandra-stress-read-l0-c0-k1-2494aaab-4173-4ff4-81e0-1b805462a533.log.

and as discussed with @roydahan, please try to arrange a quick call with @fruch to make sure we're on the same page (and that we know how to conveniently run the reproducer locally etc.)

Jan 09 '24 16:01 avelanarius

(and as I understand even though this is not a P0, but this affects a lot of Enterprise CI jobs so it could become urgent soon)

Jan 09 '24 16:01 avelanarius

@Bouncheck any progress on this one?

Jan 29 '24 13:01 roydahan

Nothing definitive yet. I'm currently considering 2 possibilities: something is not being closed properly or/and waiting for all writes to finish (with failures) after initiating graceful shutdown takes just too long.

In the logs of the first example I've seen that one of the cassandra stress runs never outputs "END" or "FAILURE" string at any point, even though the logged times go past the set duration of the stress command. In the same run there is also a single NPE which happens when driver tries to close one connection pool at the very start. Looks like race condition because the code it points to checks for null before calling method. I don't actually know if it causes anything to hang or if it's benign yet.

In the logs of second example I can see that all cassandra stress runs technically end, but there's a lot of driver spam even after that. This is probably because all writes started before shutdown need to finish. In that specific case it's hard to tell if it's happening in reasonable timeframes due to lack of timestamps. In some logs I've seen that sometimes the shutdown can take several (e.g. 5) minutes.

In the logs of the last run I can see that in Events tab in Argus that test failed due to timeout

2024-01-08 19:47:17.500: (TestTimeoutEvent Severity.CRITICAL) period_type=not-set event_id=6b125f8c-60e1-45dc-ad15-6c21b0d15b84, Test started at 2024-01-08 13:47:17, reached it's timeout (360 minute)

and even c-s related error it seems

2024-01-08 19:48:09.758: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=c15c384e-18fb-4a72-acde-492d399ec0d1, source=UpgradeTest.test_generic_cluster_upgrade (upgrade_test.UpgradeTest)() message=Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/upgrade_test.py", line 985, in test_generic_cluster_upgrade
self._run_stress_workload("stress_after_cluster_upgrade", wait_for_finish=True)
File "/home/ubuntu/scylla-cluster-tests/upgrade_test.py", line 911, in _run_stress_workload
self.verify_stress_thread(thread_pool)
File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 2091, in verify_stress_thread
results, errors = cs_thread_pool.verify_results()
File "/home/ubuntu/scylla-cluster-tests/sdcm/stress_thread.py", line 379, in verify_results
results = super().get_results()
File "/home/ubuntu/scylla-cluster-tests/sdcm/stress/base.py", line 88, in get_results
for future in concurrent.futures.as_completed(self.results_futures, timeout=timeout):
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 245, in as_completed
waiter.event.wait(wait_timeout)
File "/usr/local/lib/python3.10/threading.py", line 607, in wait
signaled = self._cond.wait(timeout)
File "/usr/local/lib/python3.10/threading.py", line 324, in wait
gotit = waiter.acquire(True, timeout)
File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 263, in critical_failure_handler
raise CriticalTestFailure("Critical Error has failed the test")  # pylint: disable=raise-missing-from
sdcm.tester.CriticalTestFailure: Critical Error has failed the test

But all c-s logs from loader-set do have "END" message and timestamps do not match with the timeout. There is one suspicious c-s log (cassandra-stress-read-l1-c0-k1-bffede14-8924-4da1-9644-3876601c74e9.log) which does not end with "freed threads" lines. Looks like it ends abruptly but its last timestamped line has timestamp 16:10:57,050 while timeout reported in Argus happens at 19:47:17.500. I am assuming the clocks are not off by more than 1 hour because earliest stress logs I've found were in minutes proximity to the start of the whole test. The times also match the reads and writes charts. I am not sure how to explain this one.

Jan 30 '24 16:01 Bouncheck

To expedite fixing the problem, @Bouncheck will try:

Rerunning the job with higher timeout (rather than the current +5%)
Maybe adding even more logging (with custom driver)
Maybe running tcpdump (in one run we saw one connection still having a lot of inflight requests which we don't know where they originated from)
Maybe running many runs in parallel to make sure our iteration time is faster and we hit the problem more often

@roydahan fyi: @Bouncheck was/is in contact with @fruch to figure out the problem and ask him about SCT/this issue

Feb 07 '24 12:02 avelanarius

@fruch / @Bouncheck I think that the longevity-10gb-3h-gce-test is one of the consistent jobs to reproduce it. @fruch maybe we can try to reproduce it with a shorter duration.

tcpdump of the entire test is not feasible in terms of size, so we need to find the correct spot to start it and end it.

Feb 07 '24 13:02 roydahan

To expedite fixing the problem, @Bouncheck will try:

Rerunning the job with higher timeout (rather than the current +5%)

this can be changed in: https://github.com/scylladb/scylla-cluster-tests/blob/master/sdcm/stress_thread.py#L365

Maybe adding even more logging (with custom driver)

supply us a fork/branch with those, and we can build a docker image, that we can use for reproducers of this issue

Maybe running tcpdump (in one run we saw one connection still having a lot of inflight requests which we don't know where they originated from)

Maybe running many runs in parallel to make sure our iteration time is faster and we hit the problem more often

the cases it's happens the most are: longevity-10gb-3h-azure-test and longevity-10gb-3h-gce-test

I don't mind changing the c-s image in those two, until we can figure this out

@roydahan fyi: @Bouncheck was/is in contact with @fruch to figure out the problem and ask him about SCT/this issue

Feb 07 '24 14:02 fruch

I have trouble understanding some of the logs. For example in this run: https://jenkins.scylladb.com/job/scylla-5.4/job/longevity/job/longevity-10gb-3h-gce-test/9/ In the loader-set-3e9f3337/cassandra-stress-write-l0-c0-k1-d967395a-8701-4848-9d4c-5225146c1a96.log I can see the following lines:

19:51:31,810 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@76a3e297 - Registering current configuration as safe fallback point

******************** Stress Settings ********************
Command:
  Type: write
  Count: -1
  Duration: 180 MINUTES
  No Warmup: true
  Consistency Level: QUORUM
[...]
  WARN  19:53:44,446 Error creating netty channel to longevity-10gb-3h-5-4-db-node-3e9f3337-0-6.c.sct-project-1.internal/10.142.0.108:9042
[...]
  Total operation time      : 03:00:00
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed

com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
END
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
WARN  20:42:09,958 Error creating netty channel to longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042

[...] stands in for all in-between logs. What confuses me about those logs is that stress command which seems to be configured for 180 minutes and reports to have 3h total running time at the end is enclosed between timestamps which are less than 1h apart, assuming timestamps are correct. Is the total running time something different than how long does the command run? Is there an easy way to confirm at what times the cluster was under load?

Feb 29 '24 23:02 Bouncheck

I have trouble understanding some of the logs. For example in this run: https://jenkins.scylladb.com/job/scylla-5.4/job/longevity/job/longevity-10gb-3h-gce-test/9/ In the loader-set-3e9f3337/cassandra-stress-write-l0-c0-k1-d967395a-8701-4848-9d4c-5225146c1a96.log I can see the following lines:

19:51:31,810 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@76a3e297 - Registering current configuration as safe fallback point

******************** Stress Settings ********************
Command:
  Type: write
  Count: -1
  Duration: 180 MINUTES
  No Warmup: true
  Consistency Level: QUORUM
[...]
  WARN  19:53:44,446 Error creating netty channel to longevity-10gb-3h-5-4-db-node-3e9f3337-0-6.c.sct-project-1.internal/10.142.0.108:9042
[...]
  Total operation time      : 03:00:00
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed

com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
END
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
WARN  20:42:09,958 Error creating netty channel to longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042

[...] stands in for all in-between logs. What confuses me about those logs is that stress command which seems to be configured for 180 minutes and reports to have 3h total running time at the end is enclosed between timestamps which are less than 1h apart, assuming timestamps are correct. Is the total running time something different than how long does the command run? Is there an easy way to confirm at what times the cluster was under load?

Yes downloading the monitoring data and looking at the metics would help with that

hydra investigate show-monitor f4ac5c0d-4243-4afe-9c28-b9e5fe051b92

You just need to find the relevant test_id for the run you want to check out

Mar 01 '24 06:03 fruch

@Bouncheck have you made any progress on that? Next week could be a good opportunity to sit together with someone from @fruch team (e.g. @soyacz ) and try debugging it together.

Apr 03 '24 20:04 roydahan

Not really, but I'm not out of options yet. Sounds like a good idea, I'll reach out.

Apr 04 '24 02:04 Bouncheck

@dkropachev

any idea how to push this on forward, it's happening on almost daily basis on SCT short longevity

Sep 19 '24 19:09 fruch

@fruch , we will take care of it on next week.

Sep 19 '24 20:09 dkropachev

java-driver java-driver copied to clipboard

[c-s][3.x] driver shutdown stuck

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

Installation details

Logs:

Installation details

Logs:

Installation details

Logs:

java-driver
java-driver copied to clipboard