java-driver
java-driver copied to clipboard
signalConnectionClosed() failing on `assert remaining >= 0`
Issue description
- [ ] This issue is a regression.
- [x] It is unknown if this issue is a regression.
at the end of c-s we are seeing the following error:
java.lang.AssertionError
at com.datastax.driver.core.ConvictionPolicy$DefaultConvictionPolicy.signalConnectionClosed(ConvictionPolicy.java:90)
at com.datastax.driver.core.Connection.closeAsync(Connection.java:1095)
at com.datastax.driver.core.HostConnectionPool.discardAvailableConnections(HostConnectionPool.java:1011)
at com.datastax.driver.core.HostConnectionPool.closeAsync(HostConnectionPool.java:972)
at com.datastax.driver.core.SessionManager.closeAsync(SessionManager.java:196)
at com.datastax.driver.core.Cluster$Manager.close(Cluster.java:2067)
at com.datastax.driver.core.Cluster$Manager.access$200(Cluster.java:1636)
at com.datastax.driver.core.Cluster.closeAsync(Cluster.java:626)
at com.datastax.driver.core.Cluster.close(Cluster.java:637)
at org.apache.cassandra.stress.util.JavaDriverClient.disconnect(JavaDriverClient.java:262)
at org.apache.cassandra.stress.settings.StressSettings.disconnect(StressSettings.java:394)
at org.apache.cassandra.stress.StressAction.run(StressAction.java:98)
at org.apache.cassandra.stress.Stress.run(Stress.java:143)
at org.apache.cassandra.stress.Stress.main(Stress.java:62)
seems like coming from: https://github.com/scylladb/java-driver/blame/b3f3ebaf161b21e5c4840ec294595d4e4b39d9bf/driver-core/src/main/java/com/datastax/driver/core/ConvictionPolicy.java#L90
Impact
It makes SCT confuse, and fail to read the summery of c-s
How frequently does it reproduce?
we don't have a specific way to reproduce it
Installation details
Kernel Version: 5.15.0-1035-azure
Scylla version (or git commit hash): 5.2.0~rc4-20230402.d70751fee3f9
with build-id 80951fe7ff3c6e2c268211c71a9236071ac18a35
Cluster size: 6 nodes (Standard_L8s_v3)
Scylla Nodes used in this run:
- longevity-10gb-3h-5-2-db-node-446a8791-eastus-8 (172.173.226.133 | 10.0.0.10) (shards: 7)
- longevity-10gb-3h-5-2-db-node-446a8791-eastus-7 (20.115.35.75 | 10.0.0.9) (shards: 7)
- longevity-10gb-3h-5-2-db-node-446a8791-eastus-6 (20.169.164.176 | 10.0.0.10) (shards: 7)
- longevity-10gb-3h-5-2-db-node-446a8791-eastus-5 (20.169.164.162 | 10.0.0.9) (shards: 7)
- longevity-10gb-3h-5-2-db-node-446a8791-eastus-4 (172.174.45.124 | 10.0.0.8) (shards: 7)
- longevity-10gb-3h-5-2-db-node-446a8791-eastus-3 (172.174.44.252 | 10.0.0.7) (shards: 7)
- longevity-10gb-3h-5-2-db-node-446a8791-eastus-2 (172.174.44.130 | 10.0.0.6) (shards: 7)
- longevity-10gb-3h-5-2-db-node-446a8791-eastus-1 (172.174.45.65 | 10.0.0.5) (shards: 7)
OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-5.2.0-rc4-x86_64-2023-04-03T01-32-26
(azure: eastus)
Test: longevity-10gb-3h-azure-test
Test id: 446a8791-c9b3-4b83-b287-c39203f80216
Test name: scylla-5.2/longevity/longevity-10gb-3h-azure-test
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor 446a8791-c9b3-4b83-b287-c39203f80216
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 446a8791-c9b3-4b83-b287-c39203f80216
Logs:
- db-cluster-446a8791.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/446a8791-c9b3-4b83-b287-c39203f80216/20230403_042309/db-cluster-446a8791.tar.gz
- sct-runner-446a8791.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/446a8791-c9b3-4b83-b287-c39203f80216/20230403_042309/sct-runner-446a8791.tar.gz
- monitor-set-446a8791.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/446a8791-c9b3-4b83-b287-c39203f80216/20230403_042309/monitor-set-446a8791.tar.gz
- loader-set-446a8791.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/446a8791-c9b3-4b83-b287-c39203f80216/20230403_042309/loader-set-446a8791.tar.gz
- parallel-timelines-report-446a8791.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/446a8791-c9b3-4b83-b287-c39203f80216/20230403_042309/parallel-timelines-report-446a8791.tar.gz
@avelanarius, any assumption of what would cause such assertion to fail ?
I don't have any at the moment, @Lorak-mmk will look into this issue.
@avelanarius @Lorak-mmk
we run into it again:
Installation details
Kernel Version: 5.15.0-1040-azure
Scylla version (or git commit hash): 5.2.3-20230608.ea08d409f155
with build-id ec8d1c19fc354f34c19e07e35880e0f40cc7d8cd
Cluster size: 6 nodes (Standard_L8s_v3)
Scylla Nodes used in this run:
- longevity-10gb-3h-5-2-db-node-9c9bf09e-eastus-9 (74.235.168.59 | 10.0.0.8) (shards: 7)
- longevity-10gb-3h-5-2-db-node-9c9bf09e-eastus-8 (23.101.133.151 | 10.0.0.9) (shards: 7)
- longevity-10gb-3h-5-2-db-node-9c9bf09e-eastus-7 (20.121.192.18 | 10.0.0.14) (shards: 7)
- longevity-10gb-3h-5-2-db-node-9c9bf09e-eastus-6 (20.172.134.170 | 10.0.0.10) (shards: 7)
- longevity-10gb-3h-5-2-db-node-9c9bf09e-eastus-5 (172.171.220.100 | 10.0.0.9) (shards: 7)
- longevity-10gb-3h-5-2-db-node-9c9bf09e-eastus-4 (20.231.56.163 | 10.0.0.8) (shards: 7)
- longevity-10gb-3h-5-2-db-node-9c9bf09e-eastus-3 (20.124.244.146 | 10.0.0.7) (shards: 7)
- longevity-10gb-3h-5-2-db-node-9c9bf09e-eastus-2 (20.124.243.147 | 10.0.0.6) (shards: 7)
- longevity-10gb-3h-5-2-db-node-9c9bf09e-eastus-11 (20.169.162.58 | 10.0.0.8) (shards: 7)
- longevity-10gb-3h-5-2-db-node-9c9bf09e-eastus-10 (13.68.236.250 | 10.0.0.7) (shards: 7)
- longevity-10gb-3h-5-2-db-node-9c9bf09e-eastus-1 (20.124.243.124 | 10.0.0.5) (shards: 7)
OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-5.2.3-x86_64-2023-06-19T09-07-22
(azure: eastus)
Test: longevity-10gb-3h-azure-test
Test id: 9c9bf09e-e825-4d63-a75f-0ae2b27345b4
Test name: scylla-5.2/longevity/longevity-10gb-3h-azure-test
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor 9c9bf09e-e825-4d63-a75f-0ae2b27345b4
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 9c9bf09e-e825-4d63-a75f-0ae2b27345b4
Logs:
- db-cluster-9c9bf09e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9c9bf09e-e825-4d63-a75f-0ae2b27345b4/20230619_112045/db-cluster-9c9bf09e.tar.gz
- sct-runner-events-9c9bf09e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9c9bf09e-e825-4d63-a75f-0ae2b27345b4/20230619_112045/sct-runner-events-9c9bf09e.tar.gz
- sct-9c9bf09e.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9c9bf09e-e825-4d63-a75f-0ae2b27345b4/20230619_112045/sct-9c9bf09e.log.tar.gz
- monitor-set-9c9bf09e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9c9bf09e-e825-4d63-a75f-0ae2b27345b4/20230619_112045/monitor-set-9c9bf09e.tar.gz
- loader-set-9c9bf09e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9c9bf09e-e825-4d63-a75f-0ae2b27345b4/20230619_112045/loader-set-9c9bf09e.tar.gz
- parallel-timelines-report-9c9bf09e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9c9bf09e-e825-4d63-a75f-0ae2b27345b4/20230619_112045/parallel-timelines-report-9c9bf09e.tar.gz
@avelanarius any progress here?
Issue description
While both stress threads were overall healthy throughout the run, they fail exactly at the end of the test:
total, 650033460, 60121, 60121, 60121, 16.6, 11.3, 48.8, 83.1, 150.1, 289.1,10790.0, 0.00143, 0, 0, 0, 0, 0, 0
total, 650338896, 61087, 61087, 61087, 16.4, 9.0, 58.3, 98.4, 155.8, 234.1,10795.0, 0.00143, 0, 0, 0, 0, 0, 0
total, 650643558, 60932, 60932, 60932, 16.4, 8.2, 59.0, 99.3, 148.0, 197.4,10800.0, 0.00142, 0, 0, 0, 0, 0, 0
total, 650667067, 56268, 56268, 56268, 17.3, 14.2, 43.0, 58.6, 86.7, 110.6,10800.4, 0.00143, 0, 0, 0, 0, 0, 0
Results:
Op rate : 60,245 op/s [WRITE: 60,245 op/s]
Partition rate : 60,245 pk/s [WRITE: 60,245 pk/s]
Row rate : 60,245 row/s [WRITE: 60,245 row/s]
Latency mean : 16.6 ms [WRITE: 16.6 ms]
Latency median : 10.0 ms [WRITE: 10.0 ms]
Latency 95th percentile : 54.0 ms [WRITE: 54.0 ms]
Latency 99th percentile : 92.1 ms [WRITE: 92.1 ms]
Latency 99.9th percentile : 149.0 ms [WRITE: 149.0 ms]
Latency max : 8535.4 ms [WRITE: 8,535.4 ms]
Total partitions : 650,667,067 [WRITE: 650,667,067]
Total errors : 0 [WRITE: 0]
Total GC count : 0
Total GC memory : 0.000 KiB
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 03:00:00
java.lang.AssertionError
END
at com.datastax.driver.core.ConvictionPolicy$DefaultConvictionPolicy.signalConnectionClosed(ConvictionPolicy.java:90)
at com.datastax.driver.core.Connection.closeAsync(Connection.java:1095)
at com.datastax.driver.core.HostConnectionPool.discardAvailableConnections(HostConnectionPool.java:1011)
at com.datastax.driver.core.HostConnectionPool.closeAsync(HostConnectionPool.java:972)
at com.datastax.driver.core.SessionManager.closeAsync(SessionManager.java:196)
at com.datastax.driver.core.Cluster$Manager.close(Cluster.java:2067)
at com.datastax.driver.core.Cluster$Manager.access$200(Cluster.java:1636)
at com.datastax.driver.core.Cluster.closeAsync(Cluster.java:626)
at com.datastax.driver.core.Cluster.close(Cluster.java:637)
at org.apache.cassandra.stress.util.JavaDriverClient.disconnect(JavaDriverClient.java:262)
at org.apache.cassandra.stress.settings.StressSettings.disconnect(StressSettings.java:394)
at org.apache.cassandra.stress.StressAction.run(StressAction.java:98)
at org.apache.cassandra.stress.Stress.run(Stress.java:143)
at org.apache.cassandra.stress.Stress.main(Stress.java:62)
total, 704367012, 65463, 65463, 65463, 15.3, 10.2, 46.7, 78.5, 125.4, 185.7,10795.0, 0.00139, 0, 0, 0, 0, 0, 0
total, 704704017, 67401, 67401, 67401, 14.8, 9.7, 45.8, 79.8, 118.4, 194.6,10800.0, 0.00139, 0, 0, 0, 0, 0, 0
total, 704768936, 63638, 63638, 63638, 15.6, 10.1, 50.2, 79.4, 116.3, 167.0,10801.0, 0.00139, 0, 0, 0, 0, 0, 0
Results:
Op rate : 65,250 op/s [WRITE: 65,250 op/s]
java.lang.AssertionError
Partition rate : 65,250 pk/s [WRITE: 65,250 pk/s]
at com.datastax.driver.core.ConvictionPolicy$DefaultConvictionPolicy.signalConnectionClosed(ConvictionPolicy.java:90)
Row rate : 65,250 row/s [WRITE: 65,250 row/s]
at com.datastax.driver.core.Connection.closeAsync(Connection.java:1095)
Latency mean : 15.3 ms [WRITE: 15.3 ms]
at com.datastax.driver.core.HostConnectionPool.discardAvailableConnections(HostConnectionPool.java:1011)
Latency median : 9.2 ms [WRITE: 9.2 ms]
at com.datastax.driver.core.HostConnectionPool.closeAsync(HostConnectionPool.java:972)
Latency 95th percentile : 49.7 ms [WRITE: 49.7 ms]
at com.datastax.driver.core.SessionManager.closeAsync(SessionManager.java:196)
Latency 99th percentile : 86.6 ms [WRITE: 86.6 ms]
at com.datastax.driver.core.Cluster$Manager.close(Cluster.java:2067)
Latency 99.9th percentile : 142.1 ms [WRITE: 142.1 ms]
at com.datastax.driver.core.Cluster$Manager.access$200(Cluster.java:1636)
Latency max : 8548.0 ms [WRITE: 8,548.0 ms]
at com.datastax.driver.core.Cluster.closeAsync(Cluster.java:626)
Total partitions : 704,768,936 [WRITE: 704,768,936]
at com.datastax.driver.core.Cluster.close(Cluster.java:637)
Total errors : 0 [WRITE: 0]
at org.apache.cassandra.stress.util.JavaDriverClient.disconnect(JavaDriverClient.java:262)
Total GC count : 0
at org.apache.cassandra.stress.settings.StressSettings.disconnect(StressSettings.java:394)
Total GC memory : 0.000 KiB
at org.apache.cassandra.stress.StressAction.run(StressAction.java:98)
Total GC time : 0.0 seconds
at org.apache.cassandra.stress.Stress.run(Stress.java:143)
Avg GC time : NaN ms
at org.apache.cassandra.stress.Stress.main(Stress.java:62)
StdDev GC time : 0.0 ms
Total operation time : 03:00:01
END
How frequently does it reproduce?
happened in both log-running stress threads in the run.
Installation details
Kernel Version: 5.15.0-1042-azure
Scylla version (or git commit hash): 2023.1.0~rc8-20230731.b6f7c5a6910c
with build-id f6e718548e76ccf3564ed2387b6582ba8d37793c
Cluster size: 6 nodes (Standard_L8s_v3)
Scylla Nodes used in this run:
- longevity-10gb-3h-2023-1-db-node-bcc0441d-eastus-9 (20.121.32.63 | 10.0.0.8) (shards: 7)
- longevity-10gb-3h-2023-1-db-node-bcc0441d-eastus-8 (20.127.14.57 | 10.0.0.9) (shards: 7)
- longevity-10gb-3h-2023-1-db-node-bcc0441d-eastus-7 (20.172.150.249 | 10.0.0.14) (shards: 7)
- longevity-10gb-3h-2023-1-db-node-bcc0441d-eastus-6 (172.178.17.8 | 10.0.0.10) (shards: 7)
- longevity-10gb-3h-2023-1-db-node-bcc0441d-eastus-5 (172.178.16.70 | 10.0.0.9) (shards: 7)
- longevity-10gb-3h-2023-1-db-node-bcc0441d-eastus-4 (74.235.172.235 | 10.0.0.8) (shards: 7)
- longevity-10gb-3h-2023-1-db-node-bcc0441d-eastus-3 (74.235.172.33 | 10.0.0.7) (shards: 7)
- longevity-10gb-3h-2023-1-db-node-bcc0441d-eastus-2 (74.235.77.246 | 10.0.0.6) (shards: 7)
- longevity-10gb-3h-2023-1-db-node-bcc0441d-eastus-11 (20.185.226.14 | 10.0.0.7) (shards: 7)
- longevity-10gb-3h-2023-1-db-node-bcc0441d-eastus-10 (172.190.170.97 | 10.0.0.10) (shards: 7)
- longevity-10gb-3h-2023-1-db-node-bcc0441d-eastus-1 (20.172.145.155 | 10.0.0.5) (shards: 7)
OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/SCYLLA-IMAGES/providers/Microsoft.Compute/images/scylla-2023.1.0-rc8-x86_64-2023-07-31T21-30-24
(azure: eastus)
Test: longevity-10gb-3h-azure-test
Test id: bcc0441d-5d7a-42d5-bb79-9e2870975688
Test name: enterprise-2023.1/longevity/longevity-10gb-3h-azure-test
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor bcc0441d-5d7a-42d5-bb79-9e2870975688
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs bcc0441d-5d7a-42d5-bb79-9e2870975688
Logs:
- db-cluster-bcc0441d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/bcc0441d-5d7a-42d5-bb79-9e2870975688/20230807_020713/db-cluster-bcc0441d.tar.gz
- sct-runner-events-bcc0441d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/bcc0441d-5d7a-42d5-bb79-9e2870975688/20230807_020713/sct-runner-events-bcc0441d.tar.gz
- sct-bcc0441d.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/bcc0441d-5d7a-42d5-bb79-9e2870975688/20230807_020713/sct-bcc0441d.log.tar.gz
- monitor-set-bcc0441d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/bcc0441d-5d7a-42d5-bb79-9e2870975688/20230807_020713/monitor-set-bcc0441d.tar.gz
- loader-set-bcc0441d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/bcc0441d-5d7a-42d5-bb79-9e2870975688/20230807_020713/loader-set-bcc0441d.tar.gz
- parallel-timelines-report-bcc0441d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/bcc0441d-5d7a-42d5-bb79-9e2870975688/20230807_020713/parallel-timelines-report-bcc0441d.tar.gz
@ShlomiBalalis - which version of the Java driver are you using? And if they fail at the end of the test, what's the user impact? And what do we see on the nodes' logs?
as written in the original report
It makes SCT confuse, and fail to read the summery of c-s
it make us need to take a close look why the stress command failed, it wastes our time. and for sure gonna confuse any user of that driver. it has no effect on scylla.
the version in 2023.1 branch is 3.11.2.4
the version in 2023.1 branch is
3.11.2.4
Why not upgrade to 3.11.2.5? The list of changes (https://github.com/scylladb/java-driver/compare/3.11.2.4...3.11.2.5 ) is significant. Nothing that I see that may solve this issue, though.
the version in 2023.1 branch is
3.11.2.4
Why not upgrade to 3.11.2.5? The list of changes (3.11.2.4...3.11.2.5 ) is significant. Nothing that I see that may solve this issue, though.
well same as any other change to a release we need a reason for backporting anything, so far there isn't any as you noticed.
if there would be a fix for this issue in the next driver release, that would be a good reason to backport it to older releases.
@fruch - I don't fully understand - by not upgrading, we are not testing, at least the following: https://github.com/scylladb/java-driver/commit/3e2d8a1766150d78bd806264ecf1e1870e0f14cf https://github.com/scylladb/java-driver/commit/bb2fcdc22384b40194becdb994906fa3a6eb0940 https://github.com/scylladb/java-driver/commit/376f03252bbee7c220aeb4f5460a55a92944b00a (and this one being the most important of them) Not to mention other stuff.
@fruch - I don't fully understand - by not upgrading, we are not testing, at least the following: 3e2d8a1 bb2fcdc 376f032 (and this one being the most important of them) Not to mention other stuff.
we are gonna be testing those first on master, like any other feature. and if it's deems it needs more testing ontop of older ongoing release, we'll then backport both the new driver and the relevant tests for it.
backporting it right now, doesn't mean the new feature would be test as part of the release. anyhow when it comes to drivers the horses are out as they get released.
the version in 2023.1 branch is
3.11.2.4
Why not upgrade to 3.11.2.5? The list of changes (3.11.2.4...3.11.2.5 ) is significant. Nothing that I see that may solve this issue, though.
I think this discussion is besides the point of the issue, if upgrading won't solve it.
The way forward to debug this issue is probably to enable more logging, especially log prints like this: https://github.com/scylladb/java-driver/blob/d291df6b35f7903c0b2d935754aebcb5b35bcd81/driver-core/src/main/java/com/datastax/driver/core/ConvictionPolicy.java#L82-L83
In the follow-up message (or PR?), I'll write how to configure the logging framework to log this specific message (we don't want to enable all DEBUG logs, as this would spam the logs too much).
it's still reproduces on master runs on Azure, not 100% of the time, but at least twice a week...
Installation details
Kernel Version: 5.15.0-1044-azure
Scylla version (or git commit hash): 5.4.0~dev-20230824.93be4c0cb0f0
with build-id 9e29ac9d5d351d94023b4d80a71e21172f311f9d
Cluster size: 6 nodes (Standard_L8s_v3)
Scylla Nodes used in this run:
- longevity-10gb-3h-master-db-node-3d0a838c-eastus-7 (23.96.110.26 | 10.0.0.5) (shards: 7)
- longevity-10gb-3h-master-db-node-3d0a838c-eastus-6 (172.173.138.168 | 10.0.0.10) (shards: 7)
- longevity-10gb-3h-master-db-node-3d0a838c-eastus-5 (172.173.138.74 | 10.0.0.9) (shards: 7)
- longevity-10gb-3h-master-db-node-3d0a838c-eastus-4 (172.173.138.21 | 10.0.0.8) (shards: 7)
- longevity-10gb-3h-master-db-node-3d0a838c-eastus-3 (172.173.138.15 | 10.0.0.7) (shards: 7)
- longevity-10gb-3h-master-db-node-3d0a838c-eastus-2 (172.173.136.93 | 10.0.0.6) (shards: 7)
- longevity-10gb-3h-master-db-node-3d0a838c-eastus-1 (172.173.136.2 | 10.0.0.5) (shards: 7)
OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-5.4.0-dev-x86_64-2023-08-28T12-54-56
(azure: undefined_region)
Test: longevity-10gb-3h-azure-test
Test id: 3d0a838c-869d-4b73-a0b5-0be75eae9559
Test name: scylla-master/longevity/longevity-10gb-3h-azure-test
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor 3d0a838c-869d-4b73-a0b5-0be75eae9559
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 3d0a838c-869d-4b73-a0b5-0be75eae9559
Logs:
- db-cluster-3d0a838c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3d0a838c-869d-4b73-a0b5-0be75eae9559/20230828_150910/db-cluster-3d0a838c.tar.gz
- sct-runner-events-3d0a838c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3d0a838c-869d-4b73-a0b5-0be75eae9559/20230828_150910/sct-runner-events-3d0a838c.tar.gz
- sct-3d0a838c.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3d0a838c-869d-4b73-a0b5-0be75eae9559/20230828_150910/sct-3d0a838c.log.tar.gz
- loader-set-3d0a838c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3d0a838c-869d-4b73-a0b5-0be75eae9559/20230828_150910/loader-set-3d0a838c.tar.gz
- monitor-set-3d0a838c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3d0a838c-869d-4b73-a0b5-0be75eae9559/20230828_150910/monitor-set-3d0a838c.tar.gz
- parallel-timelines-report-3d0a838c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3d0a838c-869d-4b73-a0b5-0be75eae9559/20230828_150910/parallel-timelines-report-3d0a838c.tar.gz
@fruch as I know we keep hitting it in Azure for master. IIUC, this time it's with scylla-driver-core-3.11.2.5-shaded.jar.
Running K8S MultiDC CI job with 6 Scylla pods/nodes (3 in each of the 2 regions) we hit this bug in about 50% cases.
It doesn't get hit running single DC with 3 nodes.
The Scylla docker image used for running cassandra-stress
is 5.2.7
.
Impact
False error events.
How frequently does it reproduce?
~50%
Installation details
Kernel Version: 5.10.198-187.748.amzn2.x86_64
Scylla version (or git commit hash): 2023.1.2-20231001.646df23cc4b3
with build-id 367fcf1672d44f5cbddc88f946cf272e2551b85a
Operator Image: scylladb/scylla-operator:latest Operator Helm Version: v1.12.0-alpha.0-123-g24389ae Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (i4i.4xlarge) | 3 Scylla pods
Scylla Nodes used in this run: No resources left at the end of the run
OS / Image: (k8s-eks: eu-north-1
, eu-west-1
)
Test: longevity-scylla-operator-multidc-12h-eks
Test id: 61d76257-c01e-4e92-8908-682a75d4e7fb
Test name: scylla-operator/operator-master/eks/longevity-scylla-operator-multidc-12h-eks
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor 61d76257-c01e-4e92-8908-682a75d4e7fb
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 61d76257-c01e-4e92-8908-682a75d4e7fb
Logs:
- kubernetes-61d76257.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/61d76257-c01e-4e92-8908-682a75d4e7fb/20231129_014613/kubernetes-61d76257.tar.gz
- kubernetes-must-gather-61d76257.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/61d76257-c01e-4e92-8908-682a75d4e7fb/20231129_014613/kubernetes-must-gather-61d76257.tar.gz
- db-cluster-61d76257.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/61d76257-c01e-4e92-8908-682a75d4e7fb/20231129_014613/db-cluster-61d76257.tar.gz
- sct-runner-events-61d76257.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/61d76257-c01e-4e92-8908-682a75d4e7fb/20231129_014613/sct-runner-events-61d76257.tar.gz
- sct-61d76257.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/61d76257-c01e-4e92-8908-682a75d4e7fb/20231129_014613/sct-61d76257.log.tar.gz
- loader-set-61d76257.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/61d76257-c01e-4e92-8908-682a75d4e7fb/20231129_014613/loader-set-61d76257.tar.gz
- monitor-set-61d76257.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/61d76257-c01e-4e92-8908-682a75d4e7fb/20231129_014613/monitor-set-61d76257.tar.gz
Happening again on the multi-dc k8s run
Installation details
Kernel Version: 5.10.199-190.747.amzn2.x86_64
Scylla version (or git commit hash): 2023.1.2-20231001.646df23cc4b3
with build-id 367fcf1672d44f5cbddc88f946cf272e2551b85a
Operator Image: scylladb/scylla-operator:latest Operator Helm Version: v1.12.0-alpha.0-144-g60f7824 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (i4i.4xlarge)
Scylla Nodes used in this run: No resources left at the end of the run
OS / Image: `` (k8s-eks: undefined_region)
Test: longevity-scylla-operator-multidc-12h-eks
Test id: 6c7d144e-bab4-4ad7-b3f2-8cd2c96422cb
Test name: scylla-operator/operator-master/eks/longevity-scylla-operator-multidc-12h-eks
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor 6c7d144e-bab4-4ad7-b3f2-8cd2c96422cb
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 6c7d144e-bab4-4ad7-b3f2-8cd2c96422cb
Logs:
- kubernetes-6c7d144e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6c7d144e-bab4-4ad7-b3f2-8cd2c96422cb/20231217_135212/kubernetes-6c7d144e.tar.gz
- kubernetes-must-gather-6c7d144e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6c7d144e-bab4-4ad7-b3f2-8cd2c96422cb/20231217_135212/kubernetes-must-gather-6c7d144e.tar.gz
- db-cluster-6c7d144e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6c7d144e-bab4-4ad7-b3f2-8cd2c96422cb/20231217_135212/db-cluster-6c7d144e.tar.gz
- sct-runner-events-6c7d144e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6c7d144e-bab4-4ad7-b3f2-8cd2c96422cb/20231217_135212/sct-runner-events-6c7d144e.tar.gz
- sct-6c7d144e.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6c7d144e-bab4-4ad7-b3f2-8cd2c96422cb/20231217_135212/sct-6c7d144e.log.tar.gz
- loader-set-6c7d144e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6c7d144e-bab4-4ad7-b3f2-8cd2c96422cb/20231217_135212/loader-set-6c7d144e.tar.gz
- monitor-set-6c7d144e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6c7d144e-bab4-4ad7-b3f2-8cd2c96422cb/20231217_135212/monitor-set-6c7d144e.tar.gz
And again,
@avelanarius
In the follow-up message (or PR?), I'll write how to configure the logging framework to log this specific message (we don't want to enable all DEBUG logs, as this would spam the logs too much).
can someone take look at this one ? and supply what's needed to debug this issue, and get it solve (it's since April, that we reported it first time)
Installation details
Kernel Version: 5.10.199-190.747.amzn2.x86_64
Scylla version (or git commit hash): 2023.1.2-20231001.646df23cc4b3
with build-id 367fcf1672d44f5cbddc88f946cf272e2551b85a
Operator Image: scylladb/scylla-operator:latest Operator Helm Version: v1.12.0-alpha.0-144-g60f7824 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (i4i.4xlarge)
Scylla Nodes used in this run: No resources left at the end of the run
OS / Image: `` (k8s-eks: undefined_region)
Test: longevity-scylla-operator-3h-multitenant-eks
Test id: c9358794-630c-4607-9f59-ef831e22eb7d
Test name: scylla-operator/operator-master/eks/longevity-scylla-operator-3h-multitenant-eks
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor c9358794-630c-4607-9f59-ef831e22eb7d
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs c9358794-630c-4607-9f59-ef831e22eb7d
Logs:
- kubernetes-c9358794.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c9358794-630c-4607-9f59-ef831e22eb7d/20231218_184649/kubernetes-c9358794.tar.gz
- kubernetes-must-gather-c9358794.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c9358794-630c-4607-9f59-ef831e22eb7d/20231218_184649/kubernetes-must-gather-c9358794.tar.gz
- db-cluster-c9358794.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c9358794-630c-4607-9f59-ef831e22eb7d/20231218_184649/db-cluster-c9358794.tar.gz
- sct-runner-events-c9358794.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c9358794-630c-4607-9f59-ef831e22eb7d/20231218_184649/sct-runner-events-c9358794.tar.gz
- sct-c9358794.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c9358794-630c-4607-9f59-ef831e22eb7d/20231218_184649/sct-c9358794.log.tar.gz
- loader-set-c9358794.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c9358794-630c-4607-9f59-ef831e22eb7d/20231218_184649/loader-set-c9358794.tar.gz
- monitor-set-c9358794.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c9358794-630c-4607-9f59-ef831e22eb7d/20231218_184649/monitor-set-c9358794.tar.gz
- parallel-timelines-report-c9358794.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c9358794-630c-4607-9f59-ef831e22eb7d/20231218_184649/parallel-timelines-report-c9358794.tar.gz
happened again on weekly k8s run:
Installation details
Kernel Version: 5.10.201-191.748.amzn2.x86_64
Scylla version (or git commit hash): 2023.1.2-20231001.646df23cc4b3
with build-id 367fcf1672d44f5cbddc88f946cf272e2551b85a
Operator Image: scylladb/scylla-operator:latest Operator Helm Version: v1.12.0-alpha.0-144-g60f7824 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (i4i.4xlarge)
Scylla Nodes used in this run: No resources left at the end of the run
OS / Image: `` (k8s-eks: undefined_region)
Test: longevity-scylla-operator-3h-multitenant-eks
Test id: cdf68a9d-3688-4538-816c-8edc1641b191
Test name: scylla-operator/operator-master/eks/longevity-scylla-operator-3h-multitenant-eks
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor cdf68a9d-3688-4538-816c-8edc1641b191
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs cdf68a9d-3688-4538-816c-8edc1641b191
Logs:
- kubernetes-cdf68a9d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/cdf68a9d-3688-4538-816c-8edc1641b191/20240107_035646/kubernetes-cdf68a9d.tar.gz
- kubernetes-must-gather-cdf68a9d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/cdf68a9d-3688-4538-816c-8edc1641b191/20240107_035646/kubernetes-must-gather-cdf68a9d.tar.gz
- db-cluster-cdf68a9d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/cdf68a9d-3688-4538-816c-8edc1641b191/20240107_035646/db-cluster-cdf68a9d.tar.gz
- sct-runner-events-cdf68a9d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/cdf68a9d-3688-4538-816c-8edc1641b191/20240107_035646/sct-runner-events-cdf68a9d.tar.gz
- sct-cdf68a9d.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/cdf68a9d-3688-4538-816c-8edc1641b191/20240107_035646/sct-cdf68a9d.log.tar.gz
- loader-set-cdf68a9d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/cdf68a9d-3688-4538-816c-8edc1641b191/20240107_035646/loader-set-cdf68a9d.tar.gz
- monitor-set-cdf68a9d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/cdf68a9d-3688-4538-816c-8edc1641b191/20240107_035646/monitor-set-cdf68a9d.tar.gz
- parallel-timelines-report-cdf68a9d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/cdf68a9d-3688-4538-816c-8edc1641b191/20240107_035646/parallel-timelines-report-cdf68a9d.tar.gz
Happened again during longevity-schema-topology-changes-12h-test
Issue description
- [ ] This issue is a regression.
- [ ] It is unknown if this issue is a regression.
Describe your issue in detail and steps it took to produce it.
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Kernel Version: 5.15.0-1051-aws
Scylla version (or git commit hash): 2023.1.4-20240112.12c616e7f0cf
with build-id e7263a4aa92cf866b98cf680bd68d7198c9690c0
Cluster size: 5 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
- longevity-parallel-topology-schema--db-node-d0e85230-9 (18.212.238.102 | 10.12.11.200) (shards: -1)
- longevity-parallel-topology-schema--db-node-d0e85230-8 (54.163.56.74 | 10.12.11.44) (shards: 7)
- longevity-parallel-topology-schema--db-node-d0e85230-7 (44.222.212.99 | 10.12.10.216) (shards: 7)
- longevity-parallel-topology-schema--db-node-d0e85230-6 (34.227.221.146 | 10.12.10.171) (shards: 7)
- longevity-parallel-topology-schema--db-node-d0e85230-5 (52.91.217.170 | 10.12.9.24) (shards: 7)
- longevity-parallel-topology-schema--db-node-d0e85230-4 (54.197.100.149 | 10.12.9.88) (shards: 7)
- longevity-parallel-topology-schema--db-node-d0e85230-3 (54.90.67.168 | 10.12.8.85) (shards: 7)
- longevity-parallel-topology-schema--db-node-d0e85230-26 (54.227.158.23 | 10.12.8.143) (shards: -1)
- longevity-parallel-topology-schema--db-node-d0e85230-25 (52.204.184.227 | 10.12.8.71) (shards: -1)
- longevity-parallel-topology-schema--db-node-d0e85230-24 (54.234.237.6 | 10.12.9.61) (shards: 7)
- longevity-parallel-topology-schema--db-node-d0e85230-23 (100.24.68.249 | 10.12.8.59) (shards: 7)
- longevity-parallel-topology-schema--db-node-d0e85230-22 (3.90.252.67 | 10.12.11.192) (shards: 7)
- longevity-parallel-topology-schema--db-node-d0e85230-21 (54.91.47.243 | 10.12.10.139) (shards: 7)
- longevity-parallel-topology-schema--db-node-d0e85230-20 (54.90.254.134 | 10.12.10.184) (shards: 7)
- longevity-parallel-topology-schema--db-node-d0e85230-2 (18.207.113.162 | 10.12.9.20) (shards: 7)
- longevity-parallel-topology-schema--db-node-d0e85230-19 (3.91.190.56 | 10.12.11.8) (shards: 7)
- longevity-parallel-topology-schema--db-node-d0e85230-18 (54.209.164.7 | 10.12.9.218) (shards: -1)
- longevity-parallel-topology-schema--db-node-d0e85230-17 (54.196.165.104 | 10.12.8.255) (shards: 7)
- longevity-parallel-topology-schema--db-node-d0e85230-16 (54.160.194.119 | 10.12.10.7) (shards: 7)
- longevity-parallel-topology-schema--db-node-d0e85230-15 (54.198.48.129 | 10.12.11.241) (shards: 7)
- longevity-parallel-topology-schema--db-node-d0e85230-14 (54.173.31.94 | 10.12.9.251) (shards: 7)
- longevity-parallel-topology-schema--db-node-d0e85230-13 (23.22.226.155 | 10.12.9.55) (shards: 7)
- longevity-parallel-topology-schema--db-node-d0e85230-12 (54.85.55.103 | 10.12.11.196) (shards: 7)
- longevity-parallel-topology-schema--db-node-d0e85230-11 (184.72.172.32 | 10.12.8.76) (shards: 7)
- longevity-parallel-topology-schema--db-node-d0e85230-10 (34.229.81.25 | 10.12.8.113) (shards: 7)
- longevity-parallel-topology-schema--db-node-d0e85230-1 (52.23.238.92 | 10.12.11.66) (shards: 7)
OS / Image: ami-08b5f8ff1565ab9f0
(aws: undefined_region)
Test: longevity-schema-topology-changes-12h-test
Test id: d0e85230-b857-4b85-af24-1de2d886a541
Test name: enterprise-2023.1/longevity/longevity-schema-topology-changes-12h-test
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor d0e85230-b857-4b85-af24-1de2d886a541
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs d0e85230-b857-4b85-af24-1de2d886a541
Logs:
- db-cluster-d0e85230.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d0e85230-b857-4b85-af24-1de2d886a541/20240113_102817/db-cluster-d0e85230.tar.gz
- email_data-d0e85230.json.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d0e85230-b857-4b85-af24-1de2d886a541/20240113_102817/email_data-d0e85230.json.tar.gz
- output-d0e85230.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d0e85230-b857-4b85-af24-1de2d886a541/20240113_102817/output-d0e85230.log.tar.gz
- debug-d0e85230.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d0e85230-b857-4b85-af24-1de2d886a541/20240113_102817/debug-d0e85230.log.tar.gz
- events-d0e85230.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d0e85230-b857-4b85-af24-1de2d886a541/20240113_102817/events-d0e85230.log.tar.gz
- normal-d0e85230.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d0e85230-b857-4b85-af24-1de2d886a541/20240113_102817/normal-d0e85230.log.tar.gz
- argus-d0e85230.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d0e85230-b857-4b85-af24-1de2d886a541/20240113_102817/argus-d0e85230.log.tar.gz
- raw_events-d0e85230.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d0e85230-b857-4b85-af24-1de2d886a541/20240113_102817/raw_events-d0e85230.log.tar.gz
- critical-d0e85230.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d0e85230-b857-4b85-af24-1de2d886a541/20240113_102817/critical-d0e85230.log.tar.gz
- warning-d0e85230.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d0e85230-b857-4b85-af24-1de2d886a541/20240113_102817/warning-d0e85230.log.tar.gz
- summary-d0e85230.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d0e85230-b857-4b85-af24-1de2d886a541/20240113_102817/summary-d0e85230.log.tar.gz
- left_processes-d0e85230.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d0e85230-b857-4b85-af24-1de2d886a541/20240113_102817/left_processes-d0e85230.log.tar.gz
- error-d0e85230.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d0e85230-b857-4b85-af24-1de2d886a541/20240113_102817/error-d0e85230.log.tar.gz
- sct-d0e85230.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d0e85230-b857-4b85-af24-1de2d886a541/20240113_102817/sct-d0e85230.log.tar.gz
- loader-set-d0e85230.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d0e85230-b857-4b85-af24-1de2d886a541/20240113_102817/loader-set-d0e85230.tar.gz
- monitor-set-d0e85230.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d0e85230-b857-4b85-af24-1de2d886a541/20240113_102817/monitor-set-d0e85230.tar.gz
@avelanarius - any updates?
@avelanarius maybe can be moved to @Bouncheck who deals with other java-drivers issues and can still make to Sprint2?
happens again on 2023.1.5
run
Feb 14, 2024 4:37:32 PM com.google.common.util.concurrent.AggregateFuture log
SEVERE: Input Future failed with Error
java.lang.AssertionError
at com.datastax.driver.core.ConvictionPolicy$DefaultConvictionPolicy.signalConnectionClosed(ConvictionPolicy.java:90)
at com.datastax.driver.core.Connection.closeAsync(Connection.java:1095)
at com.datastax.driver.core.ControlConnection.onHostGone(ControlConnection.java:1122)
at com.datastax.driver.core.ControlConnection.onRemove(ControlConnection.java:1112)
at com.datastax.driver.core.Cluster$Manager.onRemove(Cluster.java:2503)
at com.datastax.driver.core.Cluster$Manager.access$1400(Cluster.java:1560)
at com.datastax.driver.core.Cluster$Manager$NodeRefreshRequestDeliveryCallback$4.runMayThrow(Cluster.java:3235)
at com.datastax.driver.core.ExceptionCatchingRunnable.run(ExceptionCatchingRunnable.java:32)
total, 254265150, 14095, 14095, 14095, 2.8, 1.4, 6.3, 28.6, 116.0, 195.8,17080.0, 0.04657, 0, 0, 0, 0, 0, 0
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:76)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
Packages
Scylla version: 2023.1.5-20240213.08fd6aec7a43
with build-id 448979e99e198eeab4a3b0e1b929397d337d2724
Kernel Version: 5.15.0-1053-aws
Installation details
Cluster size: 5 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
- longevity-parallel-topology-schema--db-node-d6a1f069-9 (3.208.19.118 | 10.12.10.155) (shards: 7)
- longevity-parallel-topology-schema--db-node-d6a1f069-8 (3.88.99.101 | 10.12.8.219) (shards: 7)
- longevity-parallel-topology-schema--db-node-d6a1f069-7 (34.229.149.141 | 10.12.9.35) (shards: 7)
- longevity-parallel-topology-schema--db-node-d6a1f069-6 (54.174.66.96 | 10.12.11.24) (shards: 7)
- longevity-parallel-topology-schema--db-node-d6a1f069-5 (34.228.68.118 | 10.12.8.18) (shards: 7)
- longevity-parallel-topology-schema--db-node-d6a1f069-4 (34.207.148.169 | 10.12.10.43) (shards: 7)
- longevity-parallel-topology-schema--db-node-d6a1f069-3 (107.22.67.56 | 10.12.10.185) (shards: 7)
- longevity-parallel-topology-schema--db-node-d6a1f069-23 (34.230.91.117 | 10.12.11.38) (shards: 7)
- longevity-parallel-topology-schema--db-node-d6a1f069-22 (34.224.38.91 | 10.12.10.96) (shards: 7)
- longevity-parallel-topology-schema--db-node-d6a1f069-21 (54.226.154.223 | 10.12.8.155) (shards: 7)
- longevity-parallel-topology-schema--db-node-d6a1f069-20 (3.89.207.5 | 10.12.8.130) (shards: -1)
- longevity-parallel-topology-schema--db-node-d6a1f069-2 (54.163.68.110 | 10.12.9.238) (shards: 7)
- longevity-parallel-topology-schema--db-node-d6a1f069-19 (34.204.9.138 | 10.12.8.69) (shards: 7)
- longevity-parallel-topology-schema--db-node-d6a1f069-18 (34.230.21.14 | 10.12.10.223) (shards: 7)
- longevity-parallel-topology-schema--db-node-d6a1f069-17 (54.157.211.8 | 10.12.11.188) (shards: 7)
- longevity-parallel-topology-schema--db-node-d6a1f069-16 (3.80.23.111 | 10.12.9.16) (shards: 7)
- longevity-parallel-topology-schema--db-node-d6a1f069-15 (54.237.198.134 | 10.12.9.30) (shards: 7)
- longevity-parallel-topology-schema--db-node-d6a1f069-14 (3.95.212.234 | 10.12.10.4) (shards: 7)
- longevity-parallel-topology-schema--db-node-d6a1f069-13 (34.229.85.228 | 10.12.10.243) (shards: 7)
- longevity-parallel-topology-schema--db-node-d6a1f069-12 (18.207.156.115 | 10.12.8.24) (shards: 7)
- longevity-parallel-topology-schema--db-node-d6a1f069-11 (54.226.191.147 | 10.12.11.170) (shards: 7)
- longevity-parallel-topology-schema--db-node-d6a1f069-10 (54.173.57.133 | 10.12.10.116) (shards: 7)
- longevity-parallel-topology-schema--db-node-d6a1f069-1 (54.160.228.203 | 10.12.8.251) (shards: 7)
OS / Image: ami-07dcd58abd440d69d
(aws: undefined_region)
Test: longevity-schema-topology-changes-12h-test
Test id: d6a1f069-382b-4bec-9299-0d0dd507101e
Test name: enterprise-2023.1/longevity/longevity-schema-topology-changes-12h-test
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor d6a1f069-382b-4bec-9299-0d0dd507101e
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs d6a1f069-382b-4bec-9299-0d0dd507101e
Logs:
- db-cluster-d6a1f069.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d6a1f069-382b-4bec-9299-0d0dd507101e/20240215_001508/db-cluster-d6a1f069.tar.gz
- sct-runner-events-d6a1f069.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d6a1f069-382b-4bec-9299-0d0dd507101e/20240215_001508/sct-runner-events-d6a1f069.tar.gz
- sct-d6a1f069.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d6a1f069-382b-4bec-9299-0d0dd507101e/20240215_001508/sct-d6a1f069.log.tar.gz
- loader-set-d6a1f069.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d6a1f069-382b-4bec-9299-0d0dd507101e/20240215_001508/loader-set-d6a1f069.tar.gz
- monitor-set-d6a1f069.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d6a1f069-382b-4bec-9299-0d0dd507101e/20240215_001508/monitor-set-d6a1f069.tar.gz
Reproduced
WARN [cluster1-nio-worker-5] 2024-08-17 16:52:18,060 DefaultPromise.java:593 - An exception was thrown by com.datastax.driver.core.Connection$ChannelCloseListener.operationComplete()
java.lang.AssertionError: null
at com.datastax.driver.core.ConvictionPolicy$DefaultConvictionPolicy.signalConnectionFailure(ConvictionPolicy.java:101)
at com.datastax.driver.core.Connection.defunct(Connection.java:812)
at com.datastax.driver.core.Connection$ChannelCloseListener.operationComplete(Connection.java:1667)
at com.datastax.driver.core.Connection$ChannelCloseListener.operationComplete(Connection.java:1657)
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:625)
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:105)
at com.datastax.shaded.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84)
at com.datastax.shaded.netty.channel.AbstractChannel$CloseFuture.setClosed(AbstractChannel.java:1164)
at com.datastax.shaded.netty.channel.AbstractChannel$AbstractUnsafe.doClose0(AbstractChannel.java:755)
at com.datastax.shaded.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:731)
at com.datastax.shaded.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:620)
at com.datastax.shaded.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.closeOnRead(AbstractNioByteChannel.java:105)
at com.datastax.shaded.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:174)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
Packages
Scylla version: 6.2.0~dev-20240816.afee3924b3dc
with build-id c01d2a55a9631178e3fbad3869c20ef3c8dcf293
Kernel Version: 6.8.0-1013-aws
Issue description
- [ ] This issue is a regression.
- [ ] It is unknown if this issue is a regression.
Describe your issue in detail and steps it took to produce it.
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Cluster size: 5 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
- longevity-parallel-topology-schema--db-node-38b8ee85-9 (34.243.172.21 | 10.4.10.40) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-8 (34.241.87.177 | 10.4.9.153) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-7 (3.248.146.218 | 10.4.11.3) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-65 (52.30.194.155 | 10.4.10.20) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-64 (54.75.203.188 | 10.4.10.126) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-63 (52.50.49.24 | 10.4.8.120) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-62 (18.203.32.147 | 10.4.9.93) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-61 (52.31.140.46 | 10.4.10.138) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-60 (54.73.163.60 | 10.4.11.187) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-6 (176.34.188.246 | 10.4.9.179) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-59 (63.33.142.5 | 10.4.11.147) (shards: -1)
- longevity-parallel-topology-schema--db-node-38b8ee85-58 (99.80.53.93 | 10.4.9.97) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-57 (54.246.102.229 | 10.4.11.18) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-56 (54.246.114.254 | 10.4.9.231) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-55 (54.228.229.218 | 10.4.11.135) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-54 (54.228.128.54 | 10.4.8.54) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-53 (54.220.76.170 | 10.4.8.112) (shards: -1)
- longevity-parallel-topology-schema--db-node-38b8ee85-52 (54.228.155.196 | 10.4.10.62) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-51 (54.155.42.214 | 10.4.10.130) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-50 (18.202.172.255 | 10.4.9.84) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-5 (54.194.167.67 | 10.4.9.98) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-49 (34.253.77.130 | 10.4.8.112) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-48 (52.50.119.131 | 10.4.10.10) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-47 (54.247.107.226 | 10.4.10.233) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-46 (63.34.140.249 | 10.4.10.89) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-45 (52.208.115.57 | 10.4.8.139) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-44 (108.129.36.151 | 10.4.9.85) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-43 (108.129.54.0 | 10.4.9.8) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-42 (46.51.181.179 | 10.4.11.99) (shards: -1)
- longevity-parallel-topology-schema--db-node-38b8ee85-41 (52.49.169.211 | 10.4.10.106) (shards: -1)
- longevity-parallel-topology-schema--db-node-38b8ee85-40 (52.51.11.85 | 10.4.8.32) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-4 (79.125.120.229 | 10.4.10.190) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-39 (54.73.81.149 | 10.4.10.154) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-38 (52.212.235.107 | 10.4.8.34) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-37 (54.78.206.207 | 10.4.10.23) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-36 (34.249.82.167 | 10.4.10.253) (shards: -1)
- longevity-parallel-topology-schema--db-node-38b8ee85-35 (54.74.152.7 | 10.4.9.250) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-34 (52.16.207.166 | 10.4.8.26) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-33 (108.128.171.193 | 10.4.8.255) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-32 (54.217.75.3 | 10.4.9.97) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-31 (52.51.9.38 | 10.4.10.234) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-30 (54.74.38.195 | 10.4.8.201) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-3 (176.34.236.69 | 10.4.11.232) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-29 (34.243.65.171 | 10.4.9.35) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-28 (18.200.83.15 | 10.4.8.114) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-27 (54.217.206.13 | 10.4.8.32) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-26 (54.194.3.157 | 10.4.9.254) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-25 (46.137.121.193 | 10.4.8.33) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-24 (63.34.81.137 | 10.4.10.200) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-23 (52.48.243.97 | 10.4.10.153) (shards: -1)
- longevity-parallel-topology-schema--db-node-38b8ee85-22 (79.125.122.56 | 10.4.8.30) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-21 (54.220.98.27 | 10.4.10.169) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-20 (52.19.173.79 | 10.4.10.146) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-2 (34.240.255.213 | 10.4.11.26) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-19 (34.240.190.131 | 10.4.9.22) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-18 (54.247.95.169 | 10.4.9.82) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-17 (34.251.173.31 | 10.4.9.17) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-16 (34.252.201.75 | 10.4.10.157) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-15 (52.211.65.141 | 10.4.9.175) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-14 (34.241.10.58 | 10.4.8.146) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-13 (52.30.1.91 | 10.4.11.144) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-12 (52.16.190.194 | 10.4.9.176) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-11 (52.211.210.182 | 10.4.11.78) (shards: -1)
- longevity-parallel-topology-schema--db-node-38b8ee85-10 (52.211.160.16 | 10.4.10.201) (shards: 7)
- longevity-parallel-topology-schema--db-node-38b8ee85-1 (46.137.180.83 | 10.4.9.26) (shards: 7)
OS / Image: ami-0f440d7175113787f
(aws: undefined_region)
Test: longevity-schema-topology-changes-12h-test
Test id: 38b8ee85-76c8-46ba-8f22-7964bec999fa
Test name: scylla-master/tier1/longevity-schema-topology-changes-12h-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor 38b8ee85-76c8-46ba-8f22-7964bec999fa
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 38b8ee85-76c8-46ba-8f22-7964bec999fa
Logs:
- core.scylla-longevity-parallel-topology-schema--db-node-38b8ee85-57-2024-08-18_01-01-29.gz - https://storage.cloud.google.com/upload.scylladb.com/core.scylla.106.ff9e01ff74f843c8a0e1e92d112ebe05.6646.1723942560000000./core.scylla.106.ff9e01ff74f843c8a0e1e92d112ebe05.6646.1723942560000000.zst
- db-cluster-38b8ee85.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/38b8ee85-76c8-46ba-8f22-7964bec999fa/20240818_033321/db-cluster-38b8ee85.tar.gz
- sct-runner-events-38b8ee85.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/38b8ee85-76c8-46ba-8f22-7964bec999fa/20240818_033321/sct-runner-events-38b8ee85.tar.gz
- sct-38b8ee85.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/38b8ee85-76c8-46ba-8f22-7964bec999fa/20240818_033321/sct-38b8ee85.log.tar.gz
- loader-set-38b8ee85.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/38b8ee85-76c8-46ba-8f22-7964bec999fa/20240818_033321/loader-set-38b8ee85.tar.gz
- monitor-set-38b8ee85.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/38b8ee85-76c8-46ba-8f22-7964bec999fa/20240818_033321/monitor-set-38b8ee85.tar.gz
I was also able to reproduce it. Since @juliayakovlev wrote an extensive comment with all the details about this issue, I won't repeat it.