scylla-tools-java icon indicating copy to clipboard operation
scylla-tools-java copied to clipboard

cassandra-stress: counter_write workload will stuck after write all population

Open amoskong opened this issue 5 years ago • 11 comments

$ rpm -qa |grep scylla
scylla-conf-3.0.4-0.20190313.5e3a52024.el7.x86_64
scylla-tools-core-666.development-20190429.6bdb654.noarch
scylla-tools-666.development-20190429.6bdb654.noarch

counter_write workload will be stuck after duration is end out if all population is wrote. It won't repeated to rewrite as normal write workload.

[centos@ip-10-0-106-202 ~]$ cassandra-stress counter_write no-warmup cl=QUORUM duration=10s -port jmx=6868 -mode cql3 native -rate threads=1 -pop seq=1..10000  -node 10.0.226.255
...
Running COUNTER_WRITE with 1 threads 10 seconds
Failed to connect over JMX; not collecting these stats
type       total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
total,           298,     298,     298,     298,     1.0,     1.0,     1.3,     1.7,     8.7,     8.7,    1.0,  0.00000,      0,      0,       0,       0,       0,       0
total,          1554,    1256,    1256,    1256,     0.8,     0.8,     1.0,     1.1,     1.9,     2.9,    2.0,  0.43465,      0,      0,       0,       0,       0,       0
total,          2929,    1375,    1375,    1375,     0.7,     0.7,     0.9,     1.0,     1.3,     2.4,    3.0,  0.28440,      0,      0,       0,       0,       0,       0
total,          4343,    1414,    1414,    1414,     0.7,     0.7,     0.9,     1.0,     1.7,     2.5,    4.0,  0.21069,      0,      0,       0,       0,       0,       0
total,          5733,    1390,    1390,    1390,     0.7,     0.7,     0.9,     1.0,     1.3,     1.5,    5.0,  0.16648,      0,      0,       0,       0,       0,       0
total,          7159,    1426,    1426,    1426,     0.7,     0.7,     0.9,     1.0,     1.3,     1.5,    6.0,  0.13798,      0,      0,       0,       0,       0,       0
total,          8614,    1455,    1455,    1455,     0.7,     0.7,     0.9,     1.0,     1.1,     1.2,    7.0,  0.11805,      0,      0,       0,       0,       0,       0
total,         10000,    1386,    1386,    1386,     0.7,     0.7,     0.9,     1.0,     1.5,     3.4,    8.0,  0.10273,      0,      0,       0,       0,       0,       0
<stuck>

counter_write workload will successfully exit if duration is end before all population is wrote.

[centos@ip-10-0-106-202 ~]$ cassandra-stress counter_write no-warmup cl=QUORUM duration=6s -port jmx=6868 -mode cql3 native -rate threads=1 -pop seq=1..10000  -node 10.0.226.255
....
Running COUNTER_WRITE with 1 threads 6 seconds
Failed to connect over JMX; not collecting these stats
type       total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
total,           245,     245,     245,     245,     1.0,     1.0,     1.4,     2.0,     8.5,     8.5,    1.0,  0.00000,      0,      0,       0,       0,       0,       0
total,          1480,    1235,    1235,    1235,     0.8,     0.8,     1.0,     1.2,     2.6,     2.6,    2.0,  0.47184,      0,      0,       0,       0,       0,       0
total,          2809,    1329,    1329,    1329,     0.7,     0.8,     0.9,     1.1,     2.6,     2.9,    3.0,  0.30180,      0,      0,       0,       0,       0,       0
total,          4185,    1376,    1376,    1376,     0.7,     0.7,     0.9,     1.0,     1.3,     1.5,    4.0,  0.22199,      0,      0,       0,       0,       0,       0
total,          5559,    1374,    1374,    1374,     0.7,     0.7,     0.9,     1.0,     1.3,     1.6,    5.0,  0.17522,      0,      0,       0,       0,       0,       0
total,          6948,    1389,    1389,    1389,     0.7,     0.7,     0.9,     1.0,     1.5,     2.1,    6.0,  0.14484,      0,      0,       0,       0,       0,       0
total,          7982,    1438,    1438,    1438,     0.7,     0.7,     0.9,     1.0,     1.2,     1.4,    6.7,  0.12392,      0,      0,       0,       0,       0,       0


Results:
Op rate                   :    1,188 op/s  [COUNTER_WRITE: 1,188 op/s]
Partition rate            :    1,188 pk/s  [COUNTER_WRITE: 1,188 pk/s]
Row rate                  :    1,188 row/s [COUNTER_WRITE: 1,188 row/s]
Latency mean              :    0.7 ms [COUNTER_WRITE: 0.7 ms]
Latency median            :    0.7 ms [COUNTER_WRITE: 0.7 ms]
Latency 95th percentile   :    1.0 ms [COUNTER_WRITE: 1.0 ms]
Latency 99th percentile   :    1.2 ms [COUNTER_WRITE: 1.2 ms]
Latency 99.9th percentile :    2.1 ms [COUNTER_WRITE: 2.1 ms]
Latency max               :    8.5 ms [COUNTER_WRITE: 8.5 ms]
Total partitions          :      7,982 [COUNTER_WRITE: 7,982]
Total errors              :          0 [COUNTER_WRITE: 0]
Total GC count            : 0
Total GC memory           : 0.000 KiB
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:00:06

END

@roy

amoskong avatar Jul 03 '19 12:07 amoskong

@fgelcer as you bumped into it first, please help to investigate. Can you please run it with 3.0 and compare it with 3.1?

roydahan avatar Jul 03 '19 12:07 roydahan

The stuck also exists with 3.0.7 (ami-012abc8d72fd276b0)

$ rpm -qa |grep scylla
scylla-libgcc73-7.3.1-1.2.el7.centos.x86_64
scylla-conf-3.0.7-0.20190624.b6fa715f7.el7.x86_64
scylla-libatomic73-7.3.1-1.2.el7.centos.x86_64
scylla-tools-core-3.0.7-20190624.24bd7f3aad.el7.noarch
scylla-jmx-3.0.7-20190624.c9dd098.el7.noarch
scylla-env-1.1-1.el7.noarch
scylla-kernel-conf-3.0.7-0.20190624.b6fa715f7.el7.x86_64
scylla-ixgbevf-4.3.6-1dkms.noarch
scylla-server-3.0.7-0.20190624.b6fa715f7.el7.x86_64
scylla-debuginfo-3.0.7-0.20190624.b6fa715f7.el7.x86_64
scylla-ena-2.0.2-2dkms.noarch
scylla-ami-3.0.7-20190624.adbc493.el7.noarch
scylla-libstdc++73-7.3.1-1.2.el7.centos.x86_64
scylla-tools-3.0.7-20190624.24bd7f3aad.el7.noarch
scylla-3.0.7-0.20190624.b6fa715f7.el7.x86_64

amoskong avatar Jul 03 '19 13:07 amoskong

@amoskong so basically if we enlarge the -n or --seq to a very high number, it will finished o.k. ? why not just do that ?

i.e. could be that scylla become faster, or we change the cluster sizes to cause those situations, hence we didn't seen that too much before...

fruch avatar Jul 03 '19 13:07 fruch

@fruch , my job has seq=1..2097152 and it still stuck (even worst, as not always i got back the result summary of this stress job)

fgelcer avatar Jul 03 '19 13:07 fgelcer

Yes in theory. But we have very large duration for 4days, 7 days longevity. Then we need a very very big population.

On Wed, Jul 3, 2019 at 9:31 PM Israel Fruchter [email protected] wrote:

@amoskong https://github.com/amoskong so basically if we enlarge the -n or --seq to a very high number, it will finished o.k. ? why not just do that ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla-tools-java/issues/101?email_source=notifications&email_token=AACLTTBME3OM57356WEKMPDP5SS2HA5CNFSM4H5FG762YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZEONMA#issuecomment-508094128, or mute the thread https://github.com/notifications/unsubscribe-auth/AACLTTHF574TVPR3QQD3NALP5SS2HANCNFSM4H5FG76Q .

amoskong avatar Jul 03 '19 13:07 amoskong

Not only that, I think we may want to rewrite some of the values (TBH I don't have a clue how the counter-write works).

roydahan avatar Jul 03 '19 13:07 roydahan

maybe this is related to #54

bentsi avatar Jul 03 '19 14:07 bentsi

@slivne Can you involve or assign someone? The issue causes part of longevitys always fail, it's a TEST blocker.

amoskong avatar Jul 10 '19 02:07 amoskong

@amoskong to make it clear this is an issue with c-s it will happen even in cassandra,

can you check that.

slivne avatar Jul 10 '19 06:07 slivne

@amoskong to make it clear this is an issue with c-s it will happen even in cassandra,

can you check that.

The stuck problem still exist with latest Cassandra.

https://github.com/apache/cassandra.git

commit 86812fa5024d957e28f195b2c4db3813439fb2c5
Merge: d0a207b 7206ff5
Author: Blake Eggleston <[email protected]>
Date:   Mon Jul 8 15:26:22 2019 -0700

    Merge branch 'cassandra-3.11' into trunk
cqlsh> CREATE KEYSPACE IF NOT EXISTS keyspace1
   ... WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
cqlsh> 
cqlsh> CREATE TABLE IF NOT EXISTS keyspace1.counter1 (
   ...     key blob PRIMARY KEY,
   ...     "C0" counter,   
   ...     "C1" counter,   
   ...     "C2" counter,   
   ...     "C3" counter,   
   ...     "C4" counter    
   ... ) WITH bloom_filter_fp_chance = 0.01
   ...     AND comment = ''
   ...     AND compaction = {'class': 'SizeTieredCompactionStrategy'}
   ...     AND compression = {}
   ...     AND default_time_to_live = 0
   ...     AND gc_grace_seconds = 864000
   ...     AND max_index_interval = 2048
   ...     AND memtable_flush_period_in_ms = 0
   ...     AND min_index_interval = 128
   ...     AND speculative_retry = '99.0PERCENTILE';
cqlsh>


[amos@amos-centos7 apache-cassandra]$ ./tools/bin/cassandra-stress counter_write no-warmup cl=QUORUM duration=10s -port jmx=6868 -mode cql3 native -rate threads=1 -pop seq=1..1000
...
total,                                                  1000,     361,     361,     361,     2.7,     2.6,     4.9,     6.6,    10.5,    10.5,    5.0,  0.24362,      0,      0,       0,       0,       0,       0

<stuck> 

amoskong avatar Jul 10 '19 06:07 amoskong

@amoskong to make it clear this is an issue with c-s it will happen even in cassandra, can you check that.

The stuck problem still exist with latest Cassandra.

The result only means that the problem exists with Cassandra + c-s, still not sure if it's problem of c-s tool or server, or both. However it's truly an upstream issue.

amoskong avatar Jul 10 '19 06:07 amoskong