scylla-bench
scylla-bench copied to clipboard
scylla-bench fails to reconnect after altering table
Installation details
Kernel Version: 5.15.0-1026-aws
Scylla version (or git commit hash): 5.2.0~dev-20221209.6075e01312a5 with build-id 0e5d044b8f9e5bdf7f53cc3c1e959fab95bf027c
Cluster size: 9 nodes (i3.2xlarge)
Scylla Nodes used in this run:
- longevity-counters-multidc-master-db-node-7785df01-9 (54.157.115.162 | 10.12.2.62) (shards: 7)
- longevity-counters-multidc-master-db-node-7785df01-8 (3.238.92.3 | 10.12.2.95) (shards: 7)
- longevity-counters-multidc-master-db-node-7785df01-7 (3.236.190.51 | 10.12.0.119) (shards: 7)
- longevity-counters-multidc-master-db-node-7785df01-6 (54.212.64.38 | 10.15.0.77) (shards: 7)
- longevity-counters-multidc-master-db-node-7785df01-5 (35.92.94.31 | 10.15.3.207) (shards: 7)
- longevity-counters-multidc-master-db-node-7785df01-4 (34.219.193.110 | 10.15.3.94) (shards: 7)
- longevity-counters-multidc-master-db-node-7785df01-3 (52.213.121.166 | 10.4.0.42) (shards: 7)
- longevity-counters-multidc-master-db-node-7785df01-2 (54.229.18.181 | 10.4.2.143) (shards: 7)
- longevity-counters-multidc-master-db-node-7785df01-1 (34.245.75.18 | 10.4.0.195) (shards: 7)
OS / Image: ami-0b85d6f35bddaff65 ami-0a1ff01b931943772 ami-08e5c2ae0089cade3 (aws: eu-west-1)
Test: longevity-counters-6h-multidc-test
Test id: 7785df01-a1fe-483a-beb7-2f63b9044b87
Test name: scylla-master/raft/longevity-counters-6h-multidc-test
Test config file(s):
Issue description
Counters test in multidc scenario is failing persistenlty after altering table.
E.g. after running ALTER TABLE scylla_bench.test_counters WITH bloom_filter_fp_chance = 0.45374057709882093 or ALTER TABLE scylla_bench.test_counters WITH read_repair_chance = 0.9;, or even ALTER TABLE scylla_bench.test_counters WITH comment = 'IHQS6RAYS5VQ6CQZYBYEX1GP';
after such changes, scylla-bench is failing tests due error:
2022/12/09 15:26:29 error: failed to connect to "[HostInfo hostname=\"10.12.0.119\" connectAddress=\"10.12.0.119\" peer=\"<nil>\" rpc_address=\"10.12.0.119\" broadcast_address=\"10.12.0.119\" preferred_ip=\"<nil>\" connect_addr=\"10.12.0.119\" connect_addr_source=\"connect_address\" port=9042 data_centre=\"us-eastscylla_node_east\" rack=\"1a\" host_id=\"ec773dfb-ef87-4ab8-abbf-190e3e082e4c\" version=\"v3.0.8\" state=DOWN num_tokens=256]" due to error: gocql: no response to connection startup within timeout
later it looks connection is recovered - so connection issues are not permanent. But it is enough to fail test critically ending the test.
- Restore Monitor Stack command:
$ hydra investigate show-monitor 7785df01-a1fe-483a-beb7-2f63b9044b87 - Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 7785df01-a1fe-483a-beb7-2f63b9044b87
Logs:
| 20221209_161654 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_161654/grafana-screenshot-longevity-counters-6h-multidc-test-scylla-per-server-metrics-nemesis-20221209_161803-longevity-counters-multidc-master-monitor-node-7785df01-1.png | | 20221209_161654 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_161654/grafana-screenshot-overview-20221209_161654-longevity-counters-multidc-master-monitor-node-7785df01-1.png | | 20221209_162553 | db-cluster | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_162553/db-cluster-7785df01.tar.gz | | 20221209_162553 | loader-set | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_162553/loader-set-7785df01.tar.gz | | 20221209_162553 | monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_162553/monitor-set-7785df01.tar.gz | | 20221209_162553 | sct | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_162553/sct-runner-7785df01.tar.gz
maybe
Timeout: 5s
isn't enough for this test case ?
I'm not sure, disconnections were persisting sometimes for 2 minutes. We would need to test it.
I tried with timeout settings like this: -timeout 15s -retry-interval=80ms,5s -retry-number=20 and it failed anyway.
While running a large-partitions test I encountered a similar problem. Not sure if it's tied to this, but it's a possbility. After the pre-write workload, when starting one of the stress workloads, we got:
2022-12-08 21:08:42.623: (ScyllaBenchEvent Severity.CRITICAL) period_type=end event_id=a5a01a5d-ef1f-4c96-9836-7b6b23c0d77e duration=10s: node=Node longevity-large-partitions-4d-maste-loader-node-a967ab57-2 [34.249.171.113 | 10.4.2.108] (seed: False)
stress_cmd=scylla-bench -workload=uniform -mode=read -replication-factor=3 -partition-count=60 -clustering-row-count=10000000 -clustering-row-size=2048 -rows-per-request=2000 -timeout=180s -concurrency=700 -max-rate=64000 -duration=5760m -connection-count 500 -error-at-row-limit 1000 -nodes 10.4.1.5,10.4.2.90,10.4.2.71,10.4.1.191
errors:
Stress command completed with bad status 1: 2022/12/08 21:08:42 gocql: unable to create session: unable to fetch peer host info: Operation timed
Running the same job with a pinned version of scylla-bench (0.1.14) did not reproduce this issue. Similarly, a run without Raft did not fail at this point, so there might be some flakiness involved here.
Installation details
Kernel Version: 5.15.0-1026-aws
Scylla version (or git commit hash): 5.2.0~dev-20221208.a076ceef97d5 with build-id 020ec076898a692651fd48edfb1920fc190cd81e
Cluster size: 4 nodes (i3en.3xlarge)
Scylla Nodes used in this run:
- longevity-large-partitions-4d-maste-db-node-a967ab57-4 (3.252.203.198 | 10.4.1.191) (shards: 10)
- longevity-large-partitions-4d-maste-db-node-a967ab57-3 (18.203.69.233 | 10.4.2.71) (shards: 10)
- longevity-large-partitions-4d-maste-db-node-a967ab57-2 (52.212.226.132 | 10.4.2.90) (shards: 10)
- longevity-large-partitions-4d-maste-db-node-a967ab57-1 (54.194.73.19 | 10.4.1.5) (shards: 10)
OS / Image: ami-063cdd564cd2fbe46 (aws: eu-west-1)
Test: longevity-large-partition-4days-test
Test id: a967ab57-4860-4f31-8b0a-d940b857542e
Test name: scylla-master/raft/longevity-large-partition-4days-test
Test config file(s):
Issue description
>>>>>>> Your description here... <<<<<<<
- Restore Monitor Stack command:
$ hydra investigate show-monitor a967ab57-4860-4f31-8b0a-d940b857542e - Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs a967ab57-4860-4f31-8b0a-d940b857542e
Logs:
- db-cluster-a967ab57.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a967ab57-4860-4f31-8b0a-d940b857542e/20221208_212058/db-cluster-a967ab57.tar.gz
- monitor-set-a967ab57.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a967ab57-4860-4f31-8b0a-d940b857542e/20221208_212058/monitor-set-a967ab57.tar.gz
- loader-set-a967ab57.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a967ab57-4860-4f31-8b0a-d940b857542e/20221208_212058/loader-set-a967ab57.tar.gz
- sct-runner-a967ab57.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a967ab57-4860-4f31-8b0a-d940b857542e/20221208_212058/sct-runner-a967ab57.tar.gz
@avelanarius, we suspect there is a regression or at least a behavior change in how s-b works for us with later (latest?) gocql driver. We're kind of lost here on how to debug it or how to progress. Can you please help us or advise us how to debug it further?
scylla-bench failed with unable to create session: unable to fetch peer host info despite all nodes are up and ok
< t:2024-07-25 16:14:19,760 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 2024/07/25 16:14:19 gocql: unable to create session: unable to fetch peer host info: Operation timed out for system.peers - received only 0 responses from 1 CL=ONE.
< t:2024-07-25 16:14:19,761 f:base.py l:146 c:RemoteLibSSH2CmdRunner p:ERROR > Error executing command: "sudo docker exec 5e5d3d02c589373354dd8ad087985ca17a7db44f6cd5f9a9d115641b82f41fb0 /bin/sh -c 'scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=750 -partition-offset=1251 -clustering-row-count=200000 -clustering-row-size=uniform:100..8192 -concurrency=10 -connection-count=10 -consistency-level=quorum -rows-per-request=10 -timeout=90s -iterations=0 -duration=720m -error-at-row-limit 1000 -nodes 10.142.0.207,10.142.0.236,10.142.0.240,10.142.0.242,10.142.0.248'"; Exit status: 1
< t:2024-07-25 16:14:19,761 f:base.py l:150 c:RemoteLibSSH2CmdRunner p:DEBUG > STDERR: 2024/07/25 16:14:19 gocql: unable to create session: unable to fetch peer host info: Operation timed out for system.peers - received only 0 responses from 1 CL=ONE.
< t:2024-07-25 16:14:19,763 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2024-07-25 16:14:19.761: (ScyllaBenchEvent Severity.ERROR) period_type=end event_id=53d8c0a0-2870-4453-b7e3-7df585f03411 during_nemesis=RunUniqueSequence duration=18s: node=Node longevity-large-partitions-200k-pks-loader-node-53145d7f-0-1 [35.196.217.128 | 10.142.0.250]
Packages
Scylla version: 2023.1.11-20240725.11a2022bd6ed with build-id a0cab71f78c44bb0b694d46800fbcaef02607251
Kernel Version: 5.15.0-1065-gcp
Issue description
- [ ] This issue is a regression.
- [ ] It is unknown if this issue is a regression.
Describe your issue in detail and steps it took to produce it.
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Cluster size: 5 nodes (n2-highmem-16)
Scylla Nodes used in this run:
- longevity-large-partitions-200k-pks-db-node-53145d7f-0-8 (34.23.81.52 | 10.142.0.69) (shards: 14)
- longevity-large-partitions-200k-pks-db-node-53145d7f-0-7 (35.237.229.97 | 10.142.0.12) (shards: 14)
- longevity-large-partitions-200k-pks-db-node-53145d7f-0-6 (35.229.67.161 | 10.142.0.3) (shards: 14)
- longevity-large-partitions-200k-pks-db-node-53145d7f-0-5 (35.196.146.159 | 10.142.0.248) (shards: 14)
- longevity-large-partitions-200k-pks-db-node-53145d7f-0-4 (35.237.38.63 | 10.142.0.242) (shards: 14)
- longevity-large-partitions-200k-pks-db-node-53145d7f-0-3 (35.196.86.69 | 10.142.0.240) (shards: 14)
- longevity-large-partitions-200k-pks-db-node-53145d7f-0-2 (35.227.121.244 | 10.142.0.236) (shards: 14)
- longevity-large-partitions-200k-pks-db-node-53145d7f-0-1 (35.227.87.19 | 10.142.0.207) (shards: 14)
OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/6980420640571389317 (gce: undefined_region)
Test: longevity-large-partition-200k-pks-4days-gce-test
Test id: 53145d7f-6918-4728-acc6-6236916d8d08
Test name: enterprise-2023.1/longevity/longevity-large-partition-200k-pks-4days-gce-test
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor 53145d7f-6918-4728-acc6-6236916d8d08 - Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 53145d7f-6918-4728-acc6-6236916d8d08
Logs:
- db-cluster-53145d7f.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/53145d7f-6918-4728-acc6-6236916d8d08/20240726_041924/db-cluster-53145d7f.tar.gz
- sct-runner-events-53145d7f.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/53145d7f-6918-4728-acc6-6236916d8d08/20240726_041924/sct-runner-events-53145d7f.tar.gz
- sct-53145d7f.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/53145d7f-6918-4728-acc6-6236916d8d08/20240726_041924/sct-53145d7f.log.tar.gz
- loader-set-53145d7f.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/53145d7f-6918-4728-acc6-6236916d8d08/20240726_041924/loader-set-53145d7f.tar.gz
- monitor-set-53145d7f.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/53145d7f-6918-4728-acc6-6236916d8d08/20240726_041924/monitor-set-53145d7f.tar.gz
I'm trying to understand if it's a scylla-bench issue, it looks like a gocql issue to me. @sylwiaszunejko / @dkropachev can you please take a look at this one?
I'm trying to understand if it's a scylla-bench issue, it looks like a gocql issue to me. @sylwiaszunejko / @dkropachev can you please take a look at this one?
It's probably cause of scylla slowing down, the internal queries might not have enough timeouts setup.
So as always it's a combination of a scylla issue, and how strict we want to be with timeouts, and how configurable those internal queries are.