scylla-tools-java cassandra-stress can keep running even if thread had failed

Steps to reproduce are following:

Run c-s with 40 threads:

cassandra-stress read  cl=QUORUM duration=240m -schema keyspace=keyspace1 'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 
native -rate threads=40 -pop seq=1..20971520 -col 'n=FIXED(10) size=FIXED(512)' -log interval=5 -node 10.0.2.221 -errors skip-unsupported-columns

Make one thread to fail, in this test thread failed due to the CQL error of QUORUM inconsistency

Result:

c-s hung for 1hour till produced:
FAILURE
java.lang.RuntimeException: Failed to execute stress action
	at org.apache.cassandra.stress.StressAction.run(StressAction.java:101)
	at org.apache.cassandra.stress.Stress.run(Stress.java:143)
	at org.apache.cassandra.stress.Stress.main(Stress.java:62)

Test-id: 6bb58cd8-dd28-4afd-8a0d-dbc73e2489a4

c-s.log

May 25 '20 14:05 dkropachev

Another occasion with debug output: Uploading cassandra-stress-l0-c0-k1-01665285-0ef1-408f-9325-484098e432a4.log…

Jul 01 '21 09:07 dkropachev

happened during testing of 2023.1

Installation details

Kernel Version: 5.15.0-1036-aws Scylla version (or git commit hash): 2023.1.0~rc6-20230517.ca8d6a0d4fa7 with build-id 3c3e22ad787d01bbfda9da05aa4a62beb1004157

Cluster size: 3 nodes (i3en.large)

Scylla Nodes used in this run:

longevity-schemachanges-3h-2023-1-db-node-7db11cad-3 (34.242.98.148 | 10.4.1.139) (shards: 2)
longevity-schemachanges-3h-2023-1-db-node-7db11cad-2 (52.16.26.237 | 10.4.3.43) (shards: 2)
longevity-schemachanges-3h-2023-1-db-node-7db11cad-1 (54.247.60.78 | 10.4.1.101) (shards: 2)

OS / Image: ami-094190108e73c7d8e (aws: eu-west-1)

Test: longevity-schema-changes-3h-test Test id: 7db11cad-2048-48e0-8e19-c416184fa6d2 Test name: enterprise-2023.1/SCT_Enterprise_Features/audit/longevity-schema-changes-3h-test Test config file(s):

longevity-schema-changes-3h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 7db11cad-2048-48e0-8e19-c416184fa6d2
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 7db11cad-2048-48e0-8e19-c416184fa6d2

Logs:

db-cluster-7db11cad.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/db-cluster-7db11cad.tar.gz
sct-runner-events-7db11cad.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/sct-runner-events-7db11cad.tar.gz
sct-7db11cad.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/sct-7db11cad.log.tar.gz
monitor-set-7db11cad.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/monitor-set-7db11cad.tar.gz
loader-set-7db11cad.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/loader-set-7db11cad.tar.gz
parallel-timelines-report-7db11cad.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/parallel-timelines-report-7db11cad.tar.gz

Jenkins job URL Argus

Jul 02 '23 09:07 fruch

happened also in multi-dc case: https://github.com/scylladb/scylladb/issues/13667

seems like it's happening when there lots of error happening during the run

Jul 02 '23 09:07 fruch

@mykaul can you please help us assign this issue, it makes our longevities hard to investigate.

Jul 02 '23 16:07 roydahan

@roydahan, @mykaul, i will take a look at it

Jul 02 '23 22:07 dkropachev

@dkropachev any chance you looked at this one?

Sep 10 '23 09:09 roydahan

scylla-tools-java scylla-tools-java copied to clipboard

cassandra-stress can keep running even if thread had failed

Installation details

Logs:

scylla-tools-java
scylla-tools-java copied to clipboard