scylla-tools-java
scylla-tools-java copied to clipboard
cassandra-stress can keep running even if thread had failed
Steps to reproduce are following:
- Run c-s with 40 threads:
cassandra-stress read cl=QUORUM duration=240m -schema keyspace=keyspace1 'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3
native -rate threads=40 -pop seq=1..20971520 -col 'n=FIXED(10) size=FIXED(512)' -log interval=5 -node 10.0.2.221 -errors skip-unsupported-columns
- Make one thread to fail, in this test thread failed due to the CQL error of QUORUM inconsistency
Result:
c-s hung for 1hour till produced:
FAILURE
java.lang.RuntimeException: Failed to execute stress action
at org.apache.cassandra.stress.StressAction.run(StressAction.java:101)
at org.apache.cassandra.stress.Stress.run(Stress.java:143)
at org.apache.cassandra.stress.Stress.main(Stress.java:62)
Test-id: 6bb58cd8-dd28-4afd-8a0d-dbc73e2489a4
Another occasion with debug output: Uploading cassandra-stress-l0-c0-k1-01665285-0ef1-408f-9325-484098e432a4.log…
happened during testing of 2023.1
Installation details
Kernel Version: 5.15.0-1036-aws
Scylla version (or git commit hash): 2023.1.0~rc6-20230517.ca8d6a0d4fa7 with build-id 3c3e22ad787d01bbfda9da05aa4a62beb1004157
Cluster size: 3 nodes (i3en.large)
Scylla Nodes used in this run:
- longevity-schemachanges-3h-2023-1-db-node-7db11cad-3 (34.242.98.148 | 10.4.1.139) (shards: 2)
- longevity-schemachanges-3h-2023-1-db-node-7db11cad-2 (52.16.26.237 | 10.4.3.43) (shards: 2)
- longevity-schemachanges-3h-2023-1-db-node-7db11cad-1 (54.247.60.78 | 10.4.1.101) (shards: 2)
OS / Image: ami-094190108e73c7d8e (aws: eu-west-1)
Test: longevity-schema-changes-3h-test
Test id: 7db11cad-2048-48e0-8e19-c416184fa6d2
Test name: enterprise-2023.1/SCT_Enterprise_Features/audit/longevity-schema-changes-3h-test
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor 7db11cad-2048-48e0-8e19-c416184fa6d2 - Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 7db11cad-2048-48e0-8e19-c416184fa6d2
Logs:
- db-cluster-7db11cad.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/db-cluster-7db11cad.tar.gz
- sct-runner-events-7db11cad.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/sct-runner-events-7db11cad.tar.gz
- sct-7db11cad.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/sct-7db11cad.log.tar.gz
- monitor-set-7db11cad.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/monitor-set-7db11cad.tar.gz
- loader-set-7db11cad.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/loader-set-7db11cad.tar.gz
- parallel-timelines-report-7db11cad.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/parallel-timelines-report-7db11cad.tar.gz
happened also in multi-dc case: https://github.com/scylladb/scylladb/issues/13667
seems like it's happening when there lots of error happening during the run
@mykaul can you please help us assign this issue, it makes our longevities hard to investigate.
@roydahan, @mykaul, i will take a look at it
@dkropachev any chance you looked at this one?