scylla-bench
scylla-bench copied to clipboard
"unable to create session: unable to discover protocol version: gocql: no response received from cassandra within timeout period"
Installation details
Cluster size: 4 nodes (i3en.3xlarge)
Scylla Nodes used in this run: No resources left at the end of the run
OS / Image: ami-07835983d717b1ea3
(aws: eu-west-1)
Test: longevity-large-partition-4days-test
Test id: e8db664c-fe93-4ab9-9121-a2b9ed32d109
Test name: scylla-5.0/longevity/longevity-large-partition-4days-test
Test config file(s):
Issue description
>>>>>>> scylla-bench v0.1.3 fails on start with an error of:
< t:2022-04-28 13:29:26,056 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 2022/04/28 13:29:25 gocql: unable to create session: unable to discover protocol version: gocql: no response received from cassandra within timeout period
< t:2022-04-28 13:29:26,056 f:base.py l:146 c:RemoteLibSSH2CmdRunner p:ERROR > Error executing command: "/$HOME/go/bin/scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=6 -clustering-row-count=10000000 -partition-offset=54 -clustering-row-size=2048 -concurrency=250 -rows-per-request=2000 -timeout=180s -connection-count 150 -error-at-row-limit 1000 -nodes 10.0.0.64"; Exit status: 1
< t:2022-04-28 13:29:26,057 f:base.py l:150 c:RemoteLibSSH2CmdRunner p:DEBUG > STDERR: 2022/04/28 13:29:25 gocql: unable to create session: unable to discover protocol version: gocql: no response received from cassandra within timeout period
loader info:
< t:2022-04-28 12:50:25,758 f:cluster.py l:3055 c:sdcm.cluster p:INFO > Cluster longevity-large-partitions-4d-5-0-loader-set-1ec8c3b8 (AMI: ['ami-0f1099eae729b80f1'] Type: c5n.2xlarge): Init nodes
< t:2022-04-28 12:50:26,344 f:cluster_aws.py l:352 c:sdcm.cluster p:INFO > Cluster longevity-large-partitions-4d-5-0-loader-set-1ec8c3b8 (AMI: ['ami-0f1099eae729b80f1'] Type: c5n.2xlarge): Found provisioned instances = [ec2.Instance(id='i-0ed9c22d68ff40f84'), ec2.Instance(id='i-00f4982101d9f4cfe'), ec2.Instance(id='i-01b2012c36471bb1b')]
< t:2022-04-28 13:08:48,112 f:cluster.py l:3482 c:sdcm.cluster p:DEBUG > Class instance: Cluster longevity-large-partitions-4d-5-0-loader-set-1ec8c3b8 (AMI: ['ami-0f1099eae729b80f1'] Type: c5n.2xlarge)
It seems to run well on another longevity (200k-pk) that uses a different AMI and a different instance type:
< t:2022-04-13 07:53:31,070 f:cluster.py l:3055 c:sdcm.cluster p:INFO > Cluster longevity-large-partitions-200k-pks-loader-set-e9f28ce7 (AMI: ['ami-0e8fcdd01f9f0389a'] Type: c5.xlarge): Init nodes
< t:2022-04-13 07:53:31,780 f:cluster_aws.py l:352 c:sdcm.cluster p:INFO > Cluster longevity-large-partitions-200k-pks-loader-set-e9f28ce7 (AMI: ['ami-0e8fcdd01f9f0389a'] Type: c5.xlarge): Found provisioned instances = [ec2.Instance(id='i-0d5eb42e89189fdb6'), ec2.Instance(id='i-09109e2d0e23e20e8'), ec2.Instance(id='i-03b14390dde643d80'), ec2.Instance(id='i-075bdcc768c199194')]
< t:2022-04-13 08:16:21,521 f:cluster.py l:3482 c:sdcm.cluster p:DEBUG > Class instance: Cluster longevity-large-partitions-200k-pks-loader-set-e9f28ce7 (AMI: ['ami-0e8fcdd01f9f0389a'] Type: c5.xlarge)
<<<<<<<
- Restore Monitor Stack command:
$ hydra investigate show-monitor e8db664c-fe93-4ab9-9121-a2b9ed32d109
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs e8db664c-fe93-4ab9-9121-a2b9ed32d109
Logs:
@fgelcer , @roydahan , looks like it constantly fails in large-partitions-4d test. It doesn't sound like adding some retries will be usefull here. perhaps only playing with AMI/instance-type would make a difference (it may also be the same AMI on different regions), please advice.
longevity-twcs-3h-test, 2022.1.rc5 http://13.48.103.68/test_run/4f585541-af89-4da9-bd54-edf413637a9b
2 scylla-bench threads reported a lot of timeout errors in the time of running compactions:
2022/05/22 16:59:18 gocql: no response received from cassandra within timeout period
2022/05/22 16:59:20 Operation timed out for scylla_bench.test - received only 1 responses from 2 CL=QUORUM.
and failed to send load. The threads are stuck and didn't finished at time. One thread succeeded to finish
Installation details
Kernel Version: 5.13.0-1022-aws
Scylla version (or git commit hash): 2022.1~rc5-20220515.6a1e89fbb
with build-id 5cecadda59974548befb4305363bf374631fc3e1
Cluster size: 5 nodes (i3.2xlarge)
Scylla Nodes used in this run:
- longevity-twcs-3h-2022-1-db-node-4f585541-6 (16.16.67.55 | 10.0.1.30) (shards: 8)
- longevity-twcs-3h-2022-1-db-node-4f585541-5 (16.170.227.192 | 10.0.2.72) (shards: 8)
- longevity-twcs-3h-2022-1-db-node-4f585541-4 (13.49.226.206 | 10.0.1.50) (shards: 8)
- longevity-twcs-3h-2022-1-db-node-4f585541-3 (16.16.27.177 | 10.0.1.97) (shards: 8)
- longevity-twcs-3h-2022-1-db-node-4f585541-2 (16.16.65.224 | 10.0.0.222) (shards: 8)
- longevity-twcs-3h-2022-1-db-node-4f585541-1 (16.16.63.138 | 10.0.1.138) (shards: 8)
OS / Image: ami-0838dc54c055ad05a
(aws: eu-north-1)
Test: longevity-twcs-3h-test
Test id: 4f585541-af89-4da9-bd54-edf413637a9b
Test name: enterprise-2022.1/longevity/longevity-twcs-3h-test
Test config file(s):
-
Restore Monitor Stack command:
$ hydra investigate show-monitor 4f585541-af89-4da9-bd54-edf413637a9b
-
Restore monitor on AWS instance using Jenkins job
-
Show all stored logs command:
$ hydra investigate show-logs 4f585541-af89-4da9-bd54-edf413637a9b
Logs:
- db-cluster-4f585541.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4f585541-af89-4da9-bd54-edf413637a9b/20220522_195548/db-cluster-4f585541.tar.gz
- monitor-set-4f585541.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4f585541-af89-4da9-bd54-edf413637a9b/20220522_195548/monitor-set-4f585541.tar.gz
- loader-set-4f585541.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4f585541-af89-4da9-bd54-edf413637a9b/20220522_195548/loader-set-4f585541.tar.gz
- sct-runner-4f585541.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4f585541-af89-4da9-bd54-edf413637a9b/20220522_195548/sct-runner-4f585541.tar.gz
- parallel-timelines-report-4f585541.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4f585541-af89-4da9-bd54-edf413637a9b/20220522_195548/parallel-timelines-report-4f585541.tar.gz
The issue is reproduced with @aleksbykov 's fix https://github.com/scylladb/scylla-cluster-tests/pull/4834
It happened after major_compaction nemesis:
Writes were stopped almost and reads got more resources Same error:
2022/05/31 12:33:58 gocql: no response received from cassandra within timeout period
Test reached it's timeout as write s-b load has stuck and didn't finish
Installation details
Kernel Version: 5.13.0-1022-aws
Scylla version (or git commit hash): 2022.1~rc5-20220515.6a1e89fbb
with build-id 5cecadda59974548befb4305363bf374631fc3e1
Cluster size: 5 nodes (i3.2xlarge)
Scylla Nodes used in this run:
- longevity-twcs-3h-2022-1-db-node-425600b6-6 (13.53.37.47 | 10.0.0.89) (shards: 8)
- longevity-twcs-3h-2022-1-db-node-425600b6-5 (13.51.107.183 | 10.0.0.80) (shards: 8)
- longevity-twcs-3h-2022-1-db-node-425600b6-4 (16.170.237.251 | 10.0.0.20) (shards: 8)
- longevity-twcs-3h-2022-1-db-node-425600b6-3 (13.48.45.149 | 10.0.3.167) (shards: 8)
- longevity-twcs-3h-2022-1-db-node-425600b6-2 (13.49.23.19 | 10.0.2.56) (shards: 8)
- longevity-twcs-3h-2022-1-db-node-425600b6-1 (16.16.25.250 | 10.0.0.230) (shards: 8)
OS / Image: ami-0838dc54c055ad05a
(aws: eu-north-1)
Test: longevity-twcs-3h-test
Test id: 425600b6-f51b-4cf8-acc8-b2de85ce3118
Test name: enterprise-2022.1/longevity/longevity-twcs-3h-test
Test config file(s):
-
Restore Monitor Stack command:
$ hydra investigate show-monitor 425600b6-f51b-4cf8-acc8-b2de85ce3118
-
Restore monitor on AWS instance using Jenkins job
-
Show all stored logs command:
$ hydra investigate show-logs 425600b6-f51b-4cf8-acc8-b2de85ce3118
Logs:
- db-cluster-425600b6.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/425600b6-f51b-4cf8-acc8-b2de85ce3118/20220531_150806/db-cluster-425600b6.tar.gz
- monitor-set-425600b6.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/425600b6-f51b-4cf8-acc8-b2de85ce3118/20220531_150806/monitor-set-425600b6.tar.gz
- loader-set-425600b6.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/425600b6-f51b-4cf8-acc8-b2de85ce3118/20220531_150806/loader-set-425600b6.tar.gz
- sct-runner-425600b6.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/425600b6-f51b-4cf8-acc8-b2de85ce3118/20220531_150806/sct-runner-425600b6.tar.gz
- parallel-timelines-report-425600b6.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/425600b6-f51b-4cf8-acc8-b2de85ce3118/20220531_150806/parallel-timelines-report-425600b6.tar.gz
@aleksbykov / @juliayakovlev maybe major compaction for TWCS isn't efficient? Search issues for that or ask someone.
Major compaction have to compact sstables per timewindow, if for some reason sstable contain several timewindow. May it is a reason why s-b failed, because major compaction took too many resources?
@roydahan @aleksbykov I suggest I'll open the issue in Scylla. In both runs write load failed AFTER major compaction nemesis. I didn't find similar issue
@roydahan @aleksbykov I found now that write load was finished with status 0 in this time (that we think that it's problem)
< t:2022-05-22 18:51:57,373 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > Time (avg): 2h49m59.979177235s
< t:2022-05-22 18:51:57,373 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > Total ops: 433555346
< t:2022-05-22 18:51:57,374 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > Total rows: 433555346
< t:2022-05-22 18:51:57,374 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > Total errors: 7202
< t:2022-05-22 18:51:57,374 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > Operations/s: +Inf
< t:2022-05-22 18:51:57,374 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > Rows/s: +Inf
< t:2022-05-22 18:51:57,374 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > raw latency :
< t:2022-05-22 18:51:57,374 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > max: 45.197819903s
< t:2022-05-22 18:51:57,374 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 99.9th: 21.397503ms
< t:2022-05-22 18:51:57,374 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 99th: 5.636095ms
< t:2022-05-22 18:51:57,374 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 95th: 4.227071ms
< t:2022-05-22 18:51:57,374 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 90th: 3.637247ms
< t:2022-05-22 18:51:57,374 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > median: 1.474559ms
< t:2022-05-22 18:51:57,374 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > mean: 2.176019ms
< t:2022-05-22 18:51:57,374 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > c-o fixed latency :
< t:2022-05-22 18:51:57,375 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > max: 6m0.240381951s
< t:2022-05-22 18:51:57,375 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 99.9th: 6m0.240381951s
< t:2022-05-22 18:51:57,375 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 99th: 6m0.240381951s
< t:2022-05-22 18:51:57,375 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 95th: 6m0.240381951s
< t:2022-05-22 18:51:57,375 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 90th: 6m0.240381951s
< t:2022-05-22 18:51:57,375 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > median: 6m0.240381951s
< t:2022-05-22 18:51:57,375 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > mean: 18.359473709s
< t:2022-05-22 18:51:57,879 f:base.py l:140 c:RemoteLibSSH2CmdRunner p:DEBUG > 2022/05/22 17:37:47 gocql: connection closed waiting for response
< t:2022-05-22 18:51:57,879 f:base.py l:140 c:RemoteLibSSH2CmdRunner p:DEBUG > 2022/05/22 17:37:47 gocql: connection closed waiting for response
< t:2022-05-22 18:51:57,903 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "/$HOME/go/bin/scylla-bench -workload=timeseries -mode=write -replication-factor=3 -partition-count=400 -clustering-row-count=10000000 -clustering-row-size=200 -concurrency=100 -rows-per-request=100 -start-timestamp=1653235307664157164 -connection-count 100 -max-rate 50000 --timeout 120s -duration=170m -error-at-row-limit 1000 -nodes 10.0.1.138" finished with status 0
< t:2022-05-22 18:51:57,905 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2022-05-22 18:51:57.903: (ScyllaBenchEvent Severity.NORMAL) period_type=end event_id=569bc757-1db6-4c0b-b0fc-61529e109
cce duration=2h50m2s: node=Node longevity-twcs-3h-2022-1-loader-node-4f585541-1 [13.53.37.0 | 10.0.0.154] (seed: False)
< t:2022-05-22 18:51:57,905 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > stress_cmd=scylla-bench -workload=timeseries -mode=write -replication-factor=3 -partition-count=400 -clustering-row-co
unt=10000000 -clustering-row-size=200 -concurrency=100 -rows-per-request=100 -start-timestamp=SET_WRITE_TIMESTAMP -connection-count 100 -max-rate 50000 --timeout 120s -duration=170m -error-at-row-limit 1000
< t:2022-05-22 18:51:57,910 f:common.py l:2241 c:utils p:ERROR > Value write can't be converted to float. Exception: could not convert string to float: 'write'
< t:2022-05-22 18:51:57,916 f:common.py l:2241 c:utils p:ERROR > Value timeseries can't be converted to float. Exception: could not convert string to float: 'timeseries'
< t:2022-05-22 18:51:57,916 f:common.py l:2241 c:utils p:ERROR > Value quorum can't be converted to float. Exception: could not convert string to float: 'quorum'
< t:2022-05-22 18:51:57,917 f:common.py l:2241 c:utils p:ERROR > Value Fixed(200) can't be converted to float. Exception: could not convert string to float: 'Fixed(200)'
< t:2022-05-22 18:51:57,917 f:common.py l:2241 c:utils p:ERROR > Value true can't be converted to float. Exception: could not convert string to float: 'true'
< t:2022-05-22 18:51:57,918 f:scylla_bench_thread.py l:269 c:sdcm.scylla_bench_thread p:DEBUG > unknown result key found: `Start timestamp` with value `1653235307664157164`
< t:2022-05-22 18:51:57,918 f:scylla_bench_thread.py l:269 c:sdcm.scylla_bench_thread p:DEBUG > unknown result key found: `Write rate` with value `125`
< t:2022-05-22 18:51:57,918 f:scylla_bench_thread.py l:269 c:sdcm.scylla_bench_thread p:DEBUG > unknown result key found: `Hdr memory consumption` with value `49179200 bytes`
< t:2022-05-22 18:51:57,927 f:scylla_bench_thread.py l:269 c:sdcm.scylla_bench_thread p:DEBUG > unknown result key found: `Total errors` with value `7202`
< t:2022-05-22 18:51:57,927 f:scylla_bench_thread.py l:269 c:sdcm.scylla_bench_thread p:DEBUG > unknown result key found: `raw latency` with value ``
< t:2022-05-22 18:51:57,927 f:scylla_bench_thread.py l:269 c:sdcm.scylla_bench_thread p:DEBUG > unknown result key found: `mean` with value `2.176019ms`
< t:2022-05-22 18:51:57,928 f:scylla_bench_thread.py l:137 c:sdcm.scylla_bench_thread p:DEBUG > Wait for stress threads results
Continue to check
I think we may have some missing information or an unclear issue what is the expected performance of scylla-bench - is timeout value correct? is the go-driver timeout value correct?
@aleksbykov you worked on the TWCS longevity and command, please work with @juliayakovlev to check if there is issue with our s-b commands.
Reproducer of twcs 3h test with @aleksbykov recommendations:
db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/ad6a5407-459d-4141-904c-ddbaf51f70f1/20220607_113342/db-cluster-ad6a5407.tar.gz
sct - https://cloudius-jenkins-test.s3.amazonaws.com/ad6a5407-459d-4141-904c-ddbaf51f70f1/20220607_113342/sct-runner-ad6a5407.tar.gz
Job: https://jenkins.scylladb.com/job/scylla-staging/job/yulia/job/repr-longevity-twcs-3h-test/3/
I see that after write load finished, 2 read therads continue to run, but no read load.
no read load
a lot of reads failures
@juliayakovlev I don't understand this part:
I see that after write load finished, 2 read therads continue to run, but no read load.
How do you see that the 2 read threads continue to run?
@juliayakovlev I don't understand this part:
I see that after write load finished, 2 read therads continue to run, but no read load.
How do you see that the 2 read threads continue to run?
@roydahan What I see - it's gocql timeouts are reported and final stress events were published later:
Not sure if this is still relevant and if it's a tool issue or driver issue.
Not sure if this is still relevant and if it's a tool issue or driver issue.
It's from before we had retries configuration in s-b
There is nothing to do with this issue anymore.
Closing it for now