java-driver
java-driver copied to clipboard
[c-s][3.x] driver shutdown stuck
Issue description
- [ ] This issue is a regression.
- [x] It is unknown if this issue is a regression.
During the end of the stress command one node is gone (10.0.1.17), and this driver shutdown seems to be stuck
total, 4425734, 19327, 19327, 19327, 4.1, 3.9, 6.9, 9.9, 24.0, 58.8, 325.0, 0.03100, 0, 0, 0, 0, 0, 0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
total, 4508016, 16456, 16456, 16456, 4.8, 4.0, 9.6, 19.7, 91.6, 218.0, 330.0, 0.03065, 0, 0, 0, 0, 0, 0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Channel has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Channel has been closed
total, 4583006, 14998, 14998, 14998, 5.3, 4.5, 11.2, 18.1, 31.7, 62.0, 335.0, 0.03018, 0, 0, 0, 0, 0, 0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Channel has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
total, 4677544, 18908, 18908, 18908, 4.2, 4.0, 6.4, 9.0, 14.4, 20.2, 340.0, 0.03007, 0, 0, 0, 0, 0, 0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
total, 4780430, 20577, 20577, 20577, 3.9, 3.7, 6.6, 8.5, 13.4, 16.5, 345.0, 0.03022, 0, 0, 0, 0, 0, 0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
total, 4882713, 20457, 20457, 20457, 3.9, 3.7, 6.6, 9.1, 14.4, 26.8, 350.0, 0.03032, 0, 0, 0, 0, 0, 0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
total, 4985711, 20600, 20600, 20600, 3.9, 3.7, 6.6, 9.0, 13.9, 23.9, 355.0, 0.03040, 0, 0, 0, 0, 0, 0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
total, 5048570, 20374, 20374, 20374, 3.9, 3.7, 6.7, 9.2, 19.2, 24.3, 358.1, 0.03041, 0, 0, 0, 0, 0, 0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
Results:
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Op rate : 14,099 op/s [WRITE: 14,099 op/s]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Partition rate : 14,099 pk/s [WRITE: 14,099 pk/s]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Row rate : 14,099 row/s [WRITE: 14,099 row/s]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Latency mean : 5.6 ms [WRITE: 5.6 ms]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Latency median : 4.8 ms [WRITE: 4.8 ms]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Latency 95th percentile : 11.6 ms [WRITE: 11.6 ms]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Latency 99th percentile : 20.7 ms [WRITE: 20.7 ms]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Latency 99.9th percentile : 41.6 ms [WRITE: 41.6 ms]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Latency max : 365.7 ms [WRITE: 365.7 ms]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Total partitions : 5,048,570 [WRITE: 5,048,570]
WARN 00:44:50,156 Error creating netty channel to ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042
Total errors : 0 [WRITE: 0]
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042
Total GC count : 0
Caused by: java.net.ConnectException: Connection refused
Total GC memory : 0.000 KiB
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
Total GC time : 0.0 seconds
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
Avg GC time : NaN ms
at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
StdDev GC time : 0.0 ms
at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
Total operation time : 00:05:58
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
END
Impact
Stress command is getting stuck
How frequently does it reproduce?
seen once so far on the nightly runs
Installation details
Kernel Version: 5.15.0-1049-aws
Scylla version (or git commit hash): 5.5.0~dev-20231113.7b08886e8dd8 with build-id 7548c48606c5b58f282fc2b596019226de0df6ed
Cluster size: 3 nodes (i4i.large)
Scylla Nodes used in this run:
- longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-9 (13.49.80.230 | 10.0.1.175) (shards: 2)
- longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-8 (51.20.109.226 | 10.0.3.74) (shards: 2)
- longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-7 (16.16.201.232 | 10.0.1.202) (shards: 2)
- longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-6 (13.51.48.183 | 10.0.1.141) (shards: 2)
- longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-5 (51.20.123.34 | 10.0.2.227) (shards: 2)
- longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-4 (13.51.178.61 | 10.0.1.211) (shards: 2)
- longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-3 (16.171.236.237 | 10.0.1.17) (shards: 2)
- longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-2 (51.20.144.28 | 10.0.1.61) (shards: 2)
- longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-17 (16.171.239.81 | 10.0.2.188) (shards: 2)
- longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-16 (16.171.54.15 | 10.0.1.41) (shards: 2)
- longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-15 (51.20.143.39 | 10.0.3.169) (shards: 2)
- longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-14 (16.16.64.126 | 10.0.0.194) (shards: 2)
- longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-13 (16.170.157.28 | 10.0.2.96) (shards: 2)
- longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-12 (51.20.66.232 | 10.0.2.150) (shards: 2)
- longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-11 (13.48.26.181 | 10.0.1.121) (shards: 2)
- longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-10 (51.20.106.233 | 10.0.1.211) (shards: 2)
- longevity-5gb-1h-NodeTerminateAndRe-db-node-dfb19859-1 (16.171.110.198 | 10.0.2.74) (shards: 2)
OS / Image: ami-02ea62bee686ef03b (aws: undefined_region)
Test: longevity-5gb-1h-NodeTerminateAndReplace-aws-test
Test id: dfb19859-9fbb-4408-a2be-a42c7400f7cb
Test name: scylla-master/nemesis/longevity-5gb-1h-NodeTerminateAndReplace-aws-test
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor dfb19859-9fbb-4408-a2be-a42c7400f7cb - Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs dfb19859-9fbb-4408-a2be-a42c7400f7cb
Logs:
- db-cluster-dfb19859.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/db-cluster-dfb19859.tar.gz
- sct-runner-events-dfb19859.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/sct-runner-events-dfb19859.tar.gz
- sct-dfb19859.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/sct-dfb19859.log.tar.gz
- loader-set-dfb19859.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/loader-set-dfb19859.tar.gz
- monitor-set-dfb19859.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/monitor-set-dfb19859.tar.gz
Got similar problem after DecommissionStreamingErr nemesis during longevity-10gb-3h-azure-test (10.0.0.5 is a target node for this nemesis):
TAG: loader_idx:1-cpu_idx:0-keyspace_idx:1
08:37:19,314 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found resource [logback-tools.xml] at [file:/etc/scylla/cassandra/logback-tools.xml]
08:37:19,315 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback-tools.xml] occurs multiple times on the classpath.
08:37:19,315 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback-tools.xml] occurs at [file:/logback-tools.xml]
08:37:19,315 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback-tools.xml] occurs at [file:/etc/scylla/cassandra/logback-tools.xml]
08:37:19,412 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set
08:37:19,413 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
08:37:19,420 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDERR]
08:37:19,428 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
08:37:19,489 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to WARN
08:37:19,489 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDERR] to Logger[ROOT]
08:37:19,489 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.
08:37:19,490 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@2781e022 - Registering current configuration as safe fallback point
******************** Stress Settings ********************
Command:
Type: write
Count: -1
Duration: 180 MINUTES
...
===== Using optimized driver!!! =====
Connected to cluster: longevity-10gb-3h-master-db-cluster-6c30fd46, max pending requests per connection null, max connections per host 8
Datatacenter: eastus; Host: /10.0.0.5; Rack: eastus
Datatacenter: eastus; Host: /10.0.0.6; Rack: eastus
Datatacenter: eastus; Host: /10.0.0.7; Rack: eastus
Datatacenter: eastus; Host: /10.0.0.8; Rack: eastus
Datatacenter: eastus; Host: /10.0.0.9; Rack: eastus
Datatacenter: eastus; Host: /10.0.0.10; Rack: eastus
Created keyspaces. Sleeping 6s for propagation.
Sleeping 2s...
Running WRITE with 1000 threads 180 minutes
type total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb
Failed to connect over JMX; not collecting these stats
total, 132151, 26430, 26430, 26430, 33.3, 19.7, 109.8, 185.7, 335.8, 459.8, 5.0, 0.00000, 0, 0, 0, 0, 0, 0
total, 366429, 46856, 46856, 46856, 21.3, 17.7, 51.7, 73.6, 110.0, 154.9, 10.0, 0.17570, 0, 0, 0, 0, 0, 0
...
total, 662360477, 78539, 78539, 78539, 12.7, 6.7, 45.6, 75.8, 114.6, 161.6,10800.0, 0.00426, 0, 0, 0, 0, 0, 0
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
total, 662377070, 63469, 63469, 63469, 14.8, 9.3, 46.0, 68.8, 89.3, 100.1,10800.3, 0.00427, 0, 0, 0, 0, 0, 0
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Results:
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Op rate : 61,330 op/s [WRITE: 61,330 op/s]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Partition rate : 61,330 pk/s [WRITE: 61,330 pk/s]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Row rate : 61,330 row/s [WRITE: 61,330 row/s]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency mean : 16.3 ms [WRITE: 16.3 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency median : 6.8 ms [WRITE: 6.8 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency 95th percentile : 61.4 ms [WRITE: 61.4 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency 99th percentile : 149.6 ms [WRITE: 149.6 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency 99.9th percentile : 488.1 ms [WRITE: 488.1 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency max : 42312.1 ms [WRITE: 42,312.1 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total partitions : 662,377,070 [WRITE: 662,377,070]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total errors : 0 [WRITE: 0]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total GC count : 0
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total GC memory : 0.000 KiB
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total GC time : 0.0 seconds
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Avg GC time : NaN ms
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
StdDev GC time : 0.0 ms
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total operation time : 03:00:00
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
END
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
...
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Connection has been closed
...
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Connection has been closed
Installation details
Kernel Version: 5.15.0-1051-azure
Scylla version (or git commit hash): 5.5.0~dev-20231118.6bcf3ac86c64 with build-id 61307c96d481b470546e0fd8d94a69a544c4f05c
Cluster size: 6 nodes (Standard_L8s_v3)
Scylla Nodes used in this run:
- longevity-10gb-3h-master-db-node-6c30fd46-eastus-7 (172.191.228.136 | 10.0.0.9) (shards: 7)
- longevity-10gb-3h-master-db-node-6c30fd46-eastus-6 (172.173.140.123 | 10.0.0.10) (shards: 7)
- longevity-10gb-3h-master-db-node-6c30fd46-eastus-5 (172.173.140.36 | 10.0.0.9) (shards: 7)
- longevity-10gb-3h-master-db-node-6c30fd46-eastus-4 (172.173.140.33 | 10.0.0.8) (shards: 7)
- longevity-10gb-3h-master-db-node-6c30fd46-eastus-3 (172.173.140.15 | 10.0.0.7) (shards: 7)
- longevity-10gb-3h-master-db-node-6c30fd46-eastus-2 (172.173.139.243 | 10.0.0.6) (shards: 7)
- longevity-10gb-3h-master-db-node-6c30fd46-eastus-1 (172.173.139.184 | 10.0.0.5) (shards: 7)
OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-5.5.0-dev-x86_64-2023-11-18T06-26-15 (azure: undefined_region)
Test: longevity-10gb-3h-azure-test
Test id: 6c30fd46-3432-426f-824f-362eb6107b4b
Test name: scylla-master/longevity/longevity-10gb-3h-azure-test
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor 6c30fd46-3432-426f-824f-362eb6107b4b - Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 6c30fd46-3432-426f-824f-362eb6107b4b
Logs:
- db-cluster-6c30fd46.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/db-cluster-6c30fd46.tar.gz
- sct-runner-events-6c30fd46.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/sct-runner-events-6c30fd46.tar.gz
- sct-6c30fd46.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/sct-6c30fd46.log.tar.gz
- loader-set-6c30fd46.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/loader-set-6c30fd46.tar.gz
- monitor-set-6c30fd46.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/monitor-set-6c30fd46.tar.gz
- parallel-timelines-report-6c30fd46.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/parallel-timelines-report-6c30fd46.tar.gz
The job is failing like this at least since build 244.
@avelanarius - is anyone looking at this?
@avelanarius - is anyone looking at this?
Currently no one, with other tasks taking priority.
cassandra-stress hangs after finishing it's stress work:
total, 527518456, 55284, 55284, 55284, 18.3, 12.6, 55.1, 83.6, 105.1, 116.9,10800.9, 0.11788, 0, 0, 0, 0, 0, 0
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Results:
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Op rate : 48,840 op/s [WRITE: 48,840 op/s]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Partition rate : 48,840 pk/s [WRITE: 48,840 pk/s]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Row rate : 48,840 row/s [WRITE: 48,840 row/s]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency mean : 20.3 ms [WRITE: 20.3 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency median : 12.7 ms [WRITE: 12.7 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency 95th percentile : 55.2 ms [WRITE: 55.2 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency 99th percentile : 87.8 ms [WRITE: 87.8 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency 99.9th percentile : 169.2 ms [WRITE: 169.2 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency max : 90664.1 ms [WRITE: 90,664.1 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total partitions : 527,518,456 [WRITE: 527,518,456]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total errors : 0 [WRITE: 0]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total GC count : 0
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total GC memory : 0.000 KiB
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total GC time : 0.0 seconds
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Avg GC time : NaN ms
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
StdDev GC time : 0.0 ms
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total operation time : 03:00:00
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
END
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
...
Installation details
Kernel Version: 5.15.0-1048-gcp
Scylla version (or git commit hash): 2024.1.0~rc2-20231217.f57117d9cfe3 with build-id 3a4d2dfe8ef4eef5454badb34d1710a5f36a859c
Cluster size: 6 nodes (n2-highmem-16)
Scylla Nodes used in this run:
- longevity-10gb-3h-2024-1-db-node-f4ac5c0d-0-8 (34.74.6.225 | 10.142.0.5) (shards: 14)
- longevity-10gb-3h-2024-1-db-node-f4ac5c0d-0-7 (35.237.83.141 | 10.142.0.22) (shards: 14)
- longevity-10gb-3h-2024-1-db-node-f4ac5c0d-0-6 (35.196.144.201 | 10.142.0.118) (shards: 14)
- longevity-10gb-3h-2024-1-db-node-f4ac5c0d-0-5 (34.74.203.113 | 10.142.0.121) (shards: 14)
- longevity-10gb-3h-2024-1-db-node-f4ac5c0d-0-4 (34.139.221.96 | 10.142.0.120) (shards: 14)
- longevity-10gb-3h-2024-1-db-node-f4ac5c0d-0-3 (35.231.239.128 | 10.142.0.119) (shards: 14)
- longevity-10gb-3h-2024-1-db-node-f4ac5c0d-0-2 (34.23.227.220 | 10.142.0.117) (shards: 14)
- longevity-10gb-3h-2024-1-db-node-f4ac5c0d-0-1 (34.23.49.11 | 10.142.0.68) (shards: 14)
OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/2925864426790911027 (gce: undefined_region)
Test: longevity-10gb-3h-gce-test
Test id: f4ac5c0d-4243-4afe-9c28-b9e5fe051b92
Test name: enterprise-2024.1/longevity/longevity-10gb-3h-gce-test
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor f4ac5c0d-4243-4afe-9c28-b9e5fe051b92 - Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs f4ac5c0d-4243-4afe-9c28-b9e5fe051b92
Logs:
- db-cluster-f4ac5c0d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/db-cluster-f4ac5c0d.tar.gz
- sct-runner-events-f4ac5c0d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/sct-runner-events-f4ac5c0d.tar.gz
- sct-f4ac5c0d.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/sct-f4ac5c0d.log.tar.gz
- loader-set-f4ac5c0d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/loader-set-f4ac5c0d.tar.gz
- monitor-set-f4ac5c0d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/monitor-set-f4ac5c0d.tar.gz
- parallel-timelines-report-f4ac5c0d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/parallel-timelines-report-f4ac5c0d.tar.gz
One more repro here:
Installation details
Kernel Version: 5.15.0-1045-aws
Scylla version (or git commit hash): 2023.1.2-20231001.646df23cc4b3 with build-id 367fcf1672d44f5cbddc88f946cf272e2551b85a
Target Scylla version (or git commit hash): 2024.1.0~rc2-20231217.f57117d9cfe3 with build-id 3a4d2dfe8ef4eef5454badb34d1710a5f36a859c
Cluster size: 6 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
- multi-dc-rolling-upgrade--ubuntu-fo-db-node-66b220bf-6 (3.9.144.36 | 10.3.1.251) (shards: 7)
- multi-dc-rolling-upgrade--ubuntu-fo-db-node-66b220bf-5 (18.170.227.216 | 10.3.3.49) (shards: 7)
- multi-dc-rolling-upgrade--ubuntu-fo-db-node-66b220bf-4 (3.8.120.31 | 10.3.0.150) (shards: 7)
- multi-dc-rolling-upgrade--ubuntu-fo-db-node-66b220bf-3 (34.245.175.43 | 10.4.1.178) (shards: 7)
- multi-dc-rolling-upgrade--ubuntu-fo-db-node-66b220bf-2 (52.19.224.202 | 10.4.3.109) (shards: 7)
- multi-dc-rolling-upgrade--ubuntu-fo-db-node-66b220bf-1 (63.35.164.254 | 10.4.1.139) (shards: 7)
OS / Image: ami-0bca3ed0b1ef2f2fa ami-001b0989b7cf54ded (aws: undefined_region)
Test: rolling-upgrade-multidc-test
Test id: 66b220bf-623f-460a-bd36-d6fa1dc496ba
Test name: enterprise-2024.1/rolling-upgrade/rolling-upgrade-multidc-test
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor 66b220bf-623f-460a-bd36-d6fa1dc496ba - Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 66b220bf-623f-460a-bd36-d6fa1dc496ba
Logs:
- db-cluster-66b220bf.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/66b220bf-623f-460a-bd36-d6fa1dc496ba/20231219_175710/db-cluster-66b220bf.tar.gz
- sct-runner-events-66b220bf.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/66b220bf-623f-460a-bd36-d6fa1dc496ba/20231219_175710/sct-runner-events-66b220bf.tar.gz
- sct-66b220bf.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/66b220bf-623f-460a-bd36-d6fa1dc496ba/20231219_175710/sct-66b220bf.log.tar.gz
- loader-set-66b220bf.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/66b220bf-623f-460a-bd36-d6fa1dc496ba/20231219_175710/loader-set-66b220bf.tar.gz
- monitor-set-66b220bf.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/66b220bf-623f-460a-bd36-d6fa1dc496ba/20231219_175710/monitor-set-66b220bf.tar.gz
the last report from @vponomaryov is something that is reproducible running it now with c-s debug enabled
https://argus.scylladb.com/test/7748483c-dd44-4306-9e69-97c97214f482/runs?additionalRuns[]=a242eed1-bb86-48e4-b27b-aec40cbb23db
@avelanarius / @Bouncheck let's try to at least triage it ASAP. We have a quite easy way to reproduce it, let's try to understand where the issue is and if it's a recent regression or not.
I haven't understood the root cause yet, however I have a modification in mind that may help with ensuring that shutdown happens. I'll consult @avelanarius first.
I relaunched Jenkins reproducer with DEBUG logs: https://jenkins.scylladb.com/job/enterprise-2024.1/job/reproducers/job/rolling-upgrade-multidc-test/5/. The run mentioned in here:
the last report from @vponomaryov is something that is reproducible running it now with c-s debug enabled https://argus.scylladb.com/test/7748483c-dd44-4306-9e69-97c97214f482/runs?additionalRuns[]=a242eed1-bb86-48e4-b27b-aec40cbb23db
didn't yield useful logs (we're not even sure that the actual problem reproduced...)
@Bouncheck I see that the latest run reproduced the problem correctly and now there are good DEBUG logs to try to understand the situation - file cassandra-stress-read-l0-c0-k1-2494aaab-4173-4ff4-81e0-1b805462a533.log.
and as discussed with @roydahan, please try to arrange a quick call with @fruch to make sure we're on the same page (and that we know how to conveniently run the reproducer locally etc.)
(and as I understand even though this is not a P0, but this affects a lot of Enterprise CI jobs so it could become urgent soon)
@Bouncheck any progress on this one?
Nothing definitive yet. I'm currently considering 2 possibilities: something is not being closed properly or/and waiting for all writes to finish (with failures) after initiating graceful shutdown takes just too long.
In the logs of the first example I've seen that one of the cassandra stress runs never outputs "END" or "FAILURE" string at any point, even though the logged times go past the set duration of the stress command. In the same run there is also a single NPE which happens when driver tries to close one connection pool at the very start. Looks like race condition because the code it points to checks for null before calling method. I don't actually know if it causes anything to hang or if it's benign yet.
In the logs of second example I can see that all cassandra stress runs technically end, but there's a lot of driver spam even after that. This is probably because all writes started before shutdown need to finish. In that specific case it's hard to tell if it's happening in reasonable timeframes due to lack of timestamps. In some logs I've seen that sometimes the shutdown can take several (e.g. 5) minutes.
In the logs of the last run I can see that in Events tab in Argus that test failed due to timeout
2024-01-08 19:47:17.500: (TestTimeoutEvent Severity.CRITICAL) period_type=not-set event_id=6b125f8c-60e1-45dc-ad15-6c21b0d15b84, Test started at 2024-01-08 13:47:17, reached it's timeout (360 minute)
and even c-s related error it seems
2024-01-08 19:48:09.758: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=c15c384e-18fb-4a72-acde-492d399ec0d1, source=UpgradeTest.test_generic_cluster_upgrade (upgrade_test.UpgradeTest)() message=Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/upgrade_test.py", line 985, in test_generic_cluster_upgrade
self._run_stress_workload("stress_after_cluster_upgrade", wait_for_finish=True)
File "/home/ubuntu/scylla-cluster-tests/upgrade_test.py", line 911, in _run_stress_workload
self.verify_stress_thread(thread_pool)
File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 2091, in verify_stress_thread
results, errors = cs_thread_pool.verify_results()
File "/home/ubuntu/scylla-cluster-tests/sdcm/stress_thread.py", line 379, in verify_results
results = super().get_results()
File "/home/ubuntu/scylla-cluster-tests/sdcm/stress/base.py", line 88, in get_results
for future in concurrent.futures.as_completed(self.results_futures, timeout=timeout):
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 245, in as_completed
waiter.event.wait(wait_timeout)
File "/usr/local/lib/python3.10/threading.py", line 607, in wait
signaled = self._cond.wait(timeout)
File "/usr/local/lib/python3.10/threading.py", line 324, in wait
gotit = waiter.acquire(True, timeout)
File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 263, in critical_failure_handler
raise CriticalTestFailure("Critical Error has failed the test") # pylint: disable=raise-missing-from
sdcm.tester.CriticalTestFailure: Critical Error has failed the test
But all c-s logs from loader-set do have "END" message and timestamps do not match with the timeout. There is one suspicious c-s log (cassandra-stress-read-l1-c0-k1-bffede14-8924-4da1-9644-3876601c74e9.log) which does not end with "freed threads" lines. Looks like it ends abruptly but its last timestamped line has timestamp 16:10:57,050 while timeout reported in Argus happens at 19:47:17.500. I am assuming the clocks are not off by more than 1 hour because earliest stress logs I've found were in minutes proximity to the start of the whole test. The times also match the reads and writes charts.
I am not sure how to explain this one.
To expedite fixing the problem, @Bouncheck will try:
- Rerunning the job with higher timeout (rather than the current +5%)
- Maybe adding even more logging (with custom driver)
- Maybe running tcpdump (in one run we saw one connection still having a lot of inflight requests which we don't know where they originated from)
- Maybe running many runs in parallel to make sure our iteration time is faster and we hit the problem more often
@roydahan fyi: @Bouncheck was/is in contact with @fruch to figure out the problem and ask him about SCT/this issue
@fruch / @Bouncheck I think that the longevity-10gb-3h-gce-test is one of the consistent jobs to reproduce it. @fruch maybe we can try to reproduce it with a shorter duration.
tcpdump of the entire test is not feasible in terms of size, so we need to find the correct spot to start it and end it.
To expedite fixing the problem, @Bouncheck will try:
- Rerunning the job with higher timeout (rather than the current +5%)
this can be changed in: https://github.com/scylladb/scylla-cluster-tests/blob/master/sdcm/stress_thread.py#L365
- Maybe adding even more logging (with custom driver)
supply us a fork/branch with those, and we can build a docker image, that we can use for reproducers of this issue
- Maybe running tcpdump (in one run we saw one connection still having a lot of inflight requests which we don't know where they originated from)
- Maybe running many runs in parallel to make sure our iteration time is faster and we hit the problem more often
the cases it's happens the most are: longevity-10gb-3h-azure-test and longevity-10gb-3h-gce-test
I don't mind changing the c-s image in those two, until we can figure this out
@roydahan fyi: @Bouncheck was/is in contact with @fruch to figure out the problem and ask him about SCT/this issue
I have trouble understanding some of the logs.
For example in this run: https://jenkins.scylladb.com/job/scylla-5.4/job/longevity/job/longevity-10gb-3h-gce-test/9/
In the loader-set-3e9f3337/cassandra-stress-write-l0-c0-k1-d967395a-8701-4848-9d4c-5225146c1a96.log
I can see the following lines:
19:51:31,810 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@76a3e297 - Registering current configuration as safe fallback point
******************** Stress Settings ********************
Command:
Type: write
Count: -1
Duration: 180 MINUTES
No Warmup: true
Consistency Level: QUORUM
[...]
WARN 19:53:44,446 Error creating netty channel to longevity-10gb-3h-5-4-db-node-3e9f3337-0-6.c.sct-project-1.internal/10.142.0.108:9042
[...]
Total operation time : 03:00:00
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
END
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
WARN 20:42:09,958 Error creating netty channel to longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042
[...] stands in for all in-between logs. What confuses me about those logs is that stress command which seems to be configured for 180 minutes and reports to have 3h total running time at the end is enclosed between timestamps which are less than 1h apart, assuming timestamps are correct. Is the total running time something different than how long does the command run? Is there an easy way to confirm at what times the cluster was under load?
I have trouble understanding some of the logs. For example in this run: https://jenkins.scylladb.com/job/scylla-5.4/job/longevity/job/longevity-10gb-3h-gce-test/9/ In the
loader-set-3e9f3337/cassandra-stress-write-l0-c0-k1-d967395a-8701-4848-9d4c-5225146c1a96.logI can see the following lines:19:51:31,810 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@76a3e297 - Registering current configuration as safe fallback point ******************** Stress Settings ******************** Command: Type: write Count: -1 Duration: 180 MINUTES No Warmup: true Consistency Level: QUORUM [...] WARN 19:53:44,446 Error creating netty channel to longevity-10gb-3h-5-4-db-node-3e9f3337-0-6.c.sct-project-1.internal/10.142.0.108:9042 [...] Total operation time : 03:00:00 com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed END com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed WARN 20:42:09,958 Error creating netty channel to longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042[...] stands in for all in-between logs. What confuses me about those logs is that stress command which seems to be configured for 180 minutes and reports to have 3h total running time at the end is enclosed between timestamps which are less than 1h apart, assuming timestamps are correct. Is the total running time something different than how long does the command run? Is there an easy way to confirm at what times the cluster was under load?
Yes downloading the monitoring data and looking at the metics would help with that
hydra investigate show-monitor f4ac5c0d-4243-4afe-9c28-b9e5fe051b92
You just need to find the relevant test_id for the run you want to check out
@Bouncheck have you made any progress on that? Next week could be a good opportunity to sit together with someone from @fruch team (e.g. @soyacz ) and try debugging it together.
Not really, but I'm not out of options yet. Sounds like a good idea, I'll reach out.
@dkropachev
any idea how to push this on forward, it's happening on almost daily basis on SCT short longevity
@fruch , we will take care of it on next week.