scylla-tools-java RuntimeException: Timed out waiting for a timer thread

$ rpm -qa |grep scylla
scylla-tools-2.3.1-20181021.823346d3b0.el7.noarch
scylla-conf-2.3.1-0.20181021.336c77166.el7.x86_64
scylla-tools-core-2.3.1-20181021.823346d3b0.el7.noarch

Steps:

create a cluster with 4 nodes
fill data by: cassandra-stress user no-warmup profile=/tmp/sst3_schema.yaml ops'(insert=1)' cl=QUORUM n=10000000 -rate threads=1000 -pop seq=1..10000000
verify data by read1: cassandra-stress user no-warmup profile=/tmp/sst3_schema.yaml ops'(read1=1)' cl=ALL n=10000000 -rate threads=1000 -pop seq=1..10000000
verify data with multiple workload ops: cassandra-stress user no-warmup profile=/tmp/sst3_schema.yaml ops'(read1=1,read2=1,update_static=1,update_ttl=1,update_diff1_ts=1,update_diff2_ts=1,update_same1_ts=1,update_same2_ts=1)' cl=ALL n=10000000 -rate threads=200 -pop seq=1..10000000

Problem can be reproduced:

ops'(read1=1,read2=1,update_static=1,update_ttl=1,update_diff1_ts=1,update_diff2_ts=1,update_same1_ts=1,update_same2_ts=1,alter_table=0)'
ops'(read1=1,read2=1,update_static=1,update_ttl=1,update_diff1_ts=1,update_diff2_ts=1,update_same1_ts=1,update_same2_ts=1,alter_table=1)'

If I remove alter_table=* from ops parameter, the problem can't be reproduced.

ops'(read1=1,read2=1,update_static=1,update_ttl=1,update_diff1_ts=1,update_diff2_ts=1,update_same1_ts=1,update_same2_ts=1)'

Attached the complete schema file: sst3_schema.yaml.txt

$ cassandra-stress user no-warmup profile=/tmp/sst3_schema.yaml  ops'(read1=1,read2=1,update_static=1,update_ttl=1,update_diff1_ts=1,update_diff2_ts=1,update_same1_ts=1,update_same2_ts=1,alter_table=0)' cl=ALL n=10000000 -rate threads=200 -pop seq=1..10000000 -node 10.240.0.125
WARN  13:09:06 Only 49322 MB free across all data volumes. Consider adding more capacity to your cluster or removing obsolete snapshots
Connected to cluster: rolling-upgrade-l-ak-30-db-cluster-aecb964b, max pending requests per connection 128, max connections per host 8
Datatacenter: datacenter1; Host: /10.240.0.143; Rack: rack1
Datatacenter: datacenter1; Host: /10.240.0.139; Rack: rack1
Datatacenter: datacenter1; Host: /10.240.0.140; Rack: rack1
Datatacenter: datacenter1; Host: /10.240.0.125; Rack: rack1
Created schema. Sleeping 1s for propagation.
Created extra schema. Sleeping 1s for propagation.
Sleeping 2s...
Running [read1, read2, update_static, update_ttl, update_diff1_ts, update_diff2_ts, update_same1_ts, update_same2_ts, alter_table] with 200 threads for 10000000 iteration
Failed to connect over JMX; not collecting these stats
type,      total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
java.lang.RuntimeException: Timed out waiting for a timer thread - seems one got stuck. Check GC/Heap size
        at org.apache.cassandra.stress.util.Timing.snap(Timing.java:98)
        at org.apache.cassandra.stress.StressMetrics.update(StressMetrics.java:156)
        at org.apache.cassandra.stress.StressMetrics.access$300(StressMetrics.java:37)
        at org.apache.cassandra.stress.StressMetrics$2.run(StressMetrics.java:104)
        at java.lang.Thread.run(Thread.java:748)
FAILURE

insert:
  partitions: fixed(1)
  select:    fixed(1)/1000
  batchtype: UNLOGGED

queries:
  read1:
    cql: select * from user_with_ck where key = ? LIMIT 1
    fields: samerow
  read2:
    cql: select * from user_with_ck where key = ? and md5 = ? LIMIT 1
    fields: samerow
  update_static:
    cql: update user_with_ck USING TTL 5 set static_int = ? where key = ?
    fields: samerow
  update_ttl:
    cql: update user_with_ck USING TTL 5 set check_date = ? where key = ? and md5 = ?
    fields: samerow
  update_diff1_ts:
    cql: update user_with_ck USING TIMESTAMP 10 set check_date = ? where key = ? and md5 = ?
    fields: samerow
  update_diff2_ts:
    cql: update user_with_ck USING TIMESTAMP 5 set pass_date = ? where key = ? and md5 = ?
    fields: samerow
  update_same1_ts:
    cql: update user_with_ck USING TIMESTAMP 10 set check_date = '2018-01-01T11:21:59.001+0000' where key = ? and md5 = ?
    fields: samerow
  update_same2_ts:
    cql: update user_with_ck USING TIMESTAMP 10 set pass_date = '2018-01-01T11:21:59.001+0000' where key = ? and md5 = ?
    fields: samerow
  alter_table:
    cql: ALTER TABLE user_with_ck WITH comment = 'updated'
    fields: samerow
  delete_row:
    cql: delete from user_with_ck where key = ? and md5 = ?
    fields: samerow

Jan 15 '19 14:01 amoskong

The grafana snapshot (from c-s started to c-s failed): fireshot capture 112 - grafana - scylla per server metrics n_ - http___35 231 20 103_3000_dashboar

Jan 15 '19 14:01 amoskong

I think this is a c-s issue - I assume it will fail also on cassandra, can we test that.

I don't think this is a regression. e.g. there was no release before this worked on.

Jan 22 '19 13:01 slivne

scylla-tools-java scylla-tools-java copied to clipboard

RuntimeException: Timed out waiting for a timer thread - seems one got stuck. Check GC/Heap size

scylla-tools-java
scylla-tools-java copied to clipboard