scylla-tools-java
scylla-tools-java copied to clipboard
cassandra-stress: counter_write workload will stuck after write all population
$ rpm -qa |grep scylla
scylla-conf-3.0.4-0.20190313.5e3a52024.el7.x86_64
scylla-tools-core-666.development-20190429.6bdb654.noarch
scylla-tools-666.development-20190429.6bdb654.noarch
counter_write workload will be stuck after duration is end out if all population is wrote. It won't repeated to rewrite as normal write workload.
[centos@ip-10-0-106-202 ~]$ cassandra-stress counter_write no-warmup cl=QUORUM duration=10s -port jmx=6868 -mode cql3 native -rate threads=1 -pop seq=1..10000 -node 10.0.226.255
...
Running COUNTER_WRITE with 1 threads 10 seconds
Failed to connect over JMX; not collecting these stats
type total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb
total, 298, 298, 298, 298, 1.0, 1.0, 1.3, 1.7, 8.7, 8.7, 1.0, 0.00000, 0, 0, 0, 0, 0, 0
total, 1554, 1256, 1256, 1256, 0.8, 0.8, 1.0, 1.1, 1.9, 2.9, 2.0, 0.43465, 0, 0, 0, 0, 0, 0
total, 2929, 1375, 1375, 1375, 0.7, 0.7, 0.9, 1.0, 1.3, 2.4, 3.0, 0.28440, 0, 0, 0, 0, 0, 0
total, 4343, 1414, 1414, 1414, 0.7, 0.7, 0.9, 1.0, 1.7, 2.5, 4.0, 0.21069, 0, 0, 0, 0, 0, 0
total, 5733, 1390, 1390, 1390, 0.7, 0.7, 0.9, 1.0, 1.3, 1.5, 5.0, 0.16648, 0, 0, 0, 0, 0, 0
total, 7159, 1426, 1426, 1426, 0.7, 0.7, 0.9, 1.0, 1.3, 1.5, 6.0, 0.13798, 0, 0, 0, 0, 0, 0
total, 8614, 1455, 1455, 1455, 0.7, 0.7, 0.9, 1.0, 1.1, 1.2, 7.0, 0.11805, 0, 0, 0, 0, 0, 0
total, 10000, 1386, 1386, 1386, 0.7, 0.7, 0.9, 1.0, 1.5, 3.4, 8.0, 0.10273, 0, 0, 0, 0, 0, 0
<stuck>
counter_write workload will successfully exit if duration is end before all population is wrote.
[centos@ip-10-0-106-202 ~]$ cassandra-stress counter_write no-warmup cl=QUORUM duration=6s -port jmx=6868 -mode cql3 native -rate threads=1 -pop seq=1..10000 -node 10.0.226.255
....
Running COUNTER_WRITE with 1 threads 6 seconds
Failed to connect over JMX; not collecting these stats
type total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb
total, 245, 245, 245, 245, 1.0, 1.0, 1.4, 2.0, 8.5, 8.5, 1.0, 0.00000, 0, 0, 0, 0, 0, 0
total, 1480, 1235, 1235, 1235, 0.8, 0.8, 1.0, 1.2, 2.6, 2.6, 2.0, 0.47184, 0, 0, 0, 0, 0, 0
total, 2809, 1329, 1329, 1329, 0.7, 0.8, 0.9, 1.1, 2.6, 2.9, 3.0, 0.30180, 0, 0, 0, 0, 0, 0
total, 4185, 1376, 1376, 1376, 0.7, 0.7, 0.9, 1.0, 1.3, 1.5, 4.0, 0.22199, 0, 0, 0, 0, 0, 0
total, 5559, 1374, 1374, 1374, 0.7, 0.7, 0.9, 1.0, 1.3, 1.6, 5.0, 0.17522, 0, 0, 0, 0, 0, 0
total, 6948, 1389, 1389, 1389, 0.7, 0.7, 0.9, 1.0, 1.5, 2.1, 6.0, 0.14484, 0, 0, 0, 0, 0, 0
total, 7982, 1438, 1438, 1438, 0.7, 0.7, 0.9, 1.0, 1.2, 1.4, 6.7, 0.12392, 0, 0, 0, 0, 0, 0
Results:
Op rate : 1,188 op/s [COUNTER_WRITE: 1,188 op/s]
Partition rate : 1,188 pk/s [COUNTER_WRITE: 1,188 pk/s]
Row rate : 1,188 row/s [COUNTER_WRITE: 1,188 row/s]
Latency mean : 0.7 ms [COUNTER_WRITE: 0.7 ms]
Latency median : 0.7 ms [COUNTER_WRITE: 0.7 ms]
Latency 95th percentile : 1.0 ms [COUNTER_WRITE: 1.0 ms]
Latency 99th percentile : 1.2 ms [COUNTER_WRITE: 1.2 ms]
Latency 99.9th percentile : 2.1 ms [COUNTER_WRITE: 2.1 ms]
Latency max : 8.5 ms [COUNTER_WRITE: 8.5 ms]
Total partitions : 7,982 [COUNTER_WRITE: 7,982]
Total errors : 0 [COUNTER_WRITE: 0]
Total GC count : 0
Total GC memory : 0.000 KiB
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:00:06
END
@roy
@fgelcer as you bumped into it first, please help to investigate. Can you please run it with 3.0 and compare it with 3.1?
The stuck also exists with 3.0.7 (ami-012abc8d72fd276b0)
$ rpm -qa |grep scylla
scylla-libgcc73-7.3.1-1.2.el7.centos.x86_64
scylla-conf-3.0.7-0.20190624.b6fa715f7.el7.x86_64
scylla-libatomic73-7.3.1-1.2.el7.centos.x86_64
scylla-tools-core-3.0.7-20190624.24bd7f3aad.el7.noarch
scylla-jmx-3.0.7-20190624.c9dd098.el7.noarch
scylla-env-1.1-1.el7.noarch
scylla-kernel-conf-3.0.7-0.20190624.b6fa715f7.el7.x86_64
scylla-ixgbevf-4.3.6-1dkms.noarch
scylla-server-3.0.7-0.20190624.b6fa715f7.el7.x86_64
scylla-debuginfo-3.0.7-0.20190624.b6fa715f7.el7.x86_64
scylla-ena-2.0.2-2dkms.noarch
scylla-ami-3.0.7-20190624.adbc493.el7.noarch
scylla-libstdc++73-7.3.1-1.2.el7.centos.x86_64
scylla-tools-3.0.7-20190624.24bd7f3aad.el7.noarch
scylla-3.0.7-0.20190624.b6fa715f7.el7.x86_64
@amoskong so basically if we enlarge the -n or --seq to a very high number, it will finished o.k. ? why not just do that ?
i.e. could be that scylla become faster, or we change the cluster sizes to cause those situations, hence we didn't seen that too much before...
@fruch , my job has seq=1..2097152 and it still stuck (even worst, as not always i got back the result summary of this stress job)
Yes in theory. But we have very large duration for 4days, 7 days longevity. Then we need a very very big population.
On Wed, Jul 3, 2019 at 9:31 PM Israel Fruchter [email protected] wrote:
@amoskong https://github.com/amoskong so basically if we enlarge the -n or --seq to a very high number, it will finished o.k. ? why not just do that ?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla-tools-java/issues/101?email_source=notifications&email_token=AACLTTBME3OM57356WEKMPDP5SS2HA5CNFSM4H5FG762YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZEONMA#issuecomment-508094128, or mute the thread https://github.com/notifications/unsubscribe-auth/AACLTTHF574TVPR3QQD3NALP5SS2HANCNFSM4H5FG76Q .
Not only that, I think we may want to rewrite some of the values (TBH I don't have a clue how the counter-write works).
maybe this is related to #54
@slivne Can you involve or assign someone? The issue causes part of longevitys always fail, it's a TEST blocker.
@amoskong to make it clear this is an issue with c-s it will happen even in cassandra,
can you check that.
@amoskong to make it clear this is an issue with c-s it will happen even in cassandra,
can you check that.
The stuck problem still exist with latest Cassandra.
https://github.com/apache/cassandra.git
commit 86812fa5024d957e28f195b2c4db3813439fb2c5
Merge: d0a207b 7206ff5
Author: Blake Eggleston <[email protected]>
Date: Mon Jul 8 15:26:22 2019 -0700
Merge branch 'cassandra-3.11' into trunk
cqlsh> CREATE KEYSPACE IF NOT EXISTS keyspace1
... WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
cqlsh>
cqlsh> CREATE TABLE IF NOT EXISTS keyspace1.counter1 (
... key blob PRIMARY KEY,
... "C0" counter,
... "C1" counter,
... "C2" counter,
... "C3" counter,
... "C4" counter
... ) WITH bloom_filter_fp_chance = 0.01
... AND comment = ''
... AND compaction = {'class': 'SizeTieredCompactionStrategy'}
... AND compression = {}
... AND default_time_to_live = 0
... AND gc_grace_seconds = 864000
... AND max_index_interval = 2048
... AND memtable_flush_period_in_ms = 0
... AND min_index_interval = 128
... AND speculative_retry = '99.0PERCENTILE';
cqlsh>
[amos@amos-centos7 apache-cassandra]$ ./tools/bin/cassandra-stress counter_write no-warmup cl=QUORUM duration=10s -port jmx=6868 -mode cql3 native -rate threads=1 -pop seq=1..1000
...
total, 1000, 361, 361, 361, 2.7, 2.6, 4.9, 6.6, 10.5, 10.5, 5.0, 0.24362, 0, 0, 0, 0, 0, 0
<stuck>
@amoskong to make it clear this is an issue with c-s it will happen even in cassandra, can you check that.
The stuck problem still exist with latest Cassandra.
The result only means that the problem exists with Cassandra + c-s, still not sure if it's problem of c-s tool or server, or both. However it's truly an upstream issue.