scylla-bench
scylla-bench copied to clipboard
Loader node with scylla-bench v0.1.8 got a core dump
Installation details
Kernel version: 5.11.0-1022-aws
Scylla version (or git commit hash): 4.6.rc5-0.20220203.5694ec189 with build-id f5d85bf5abe6d2f9fd3487e2469ce1c34304cc14
Cluster size: 4 nodes (i3en.3xlarge)
Scylla running with shards number (live nodes):
longevity-large-partitions-4d-4-6-db-node-e2adc2e9-1 (16.170.220.3 | 10.0.3.180): 12 shards
longevity-large-partitions-4d-4-6-db-node-e2adc2e9-2 (13.48.106.98 | 10.0.1.75): 12 shards
longevity-large-partitions-4d-4-6-db-node-e2adc2e9-4 (13.51.193.35 | 10.0.3.6): 12 shards
longevity-large-partitions-4d-4-6-db-node-e2adc2e9-5 (16.171.64.136 | 10.0.0.210): 12 shards
Scylla running with shards number (terminated nodes):
longevity-large-partitions-4d-4-6-db-node-e2adc2e9-3 (16.170.157.129 | 10.0.3.67): 12 shards
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-099a011bd5f16a168
(aws: eu-north-1)
Test: longevity-large-partition-4days-test
Test name: longevity_large_partition_test.LargePartitionLongevityTest.test_large_partition_longevity
Test config file(s):
Issue description
====================================
Two loader nodes running scylla-bench v0.1.8 got 3 core dumps:
2022-02-04 19:12:38.443: (CoreDumpEvent Severity.ERROR) period_type=one-time event_id=808c9044-c851-4fe5-884a-c7217aa8d4c7 node=Node longevity-large-partitions-4d-4-6-loader-node-e2adc2e9-2 [16.170.143.136 | 10.0.3.155] (seed: False)
2022-02-04 19:30:02.330: (CoreDumpEvent Severity.ERROR) period_type=one-time event_id=39c8e89c-222a-4e89-bfef-a0ad19fe9903 node=Node longevity-large-partitions-4d-4-6-loader-node-e2adc2e9-1 [13.48.13.196 | 10.0.3.125] (seed: False)
2022-02-04 20:50:58.048: (CoreDumpEvent Severity.ERROR) period_type=one-time event_id=c3e3ce43-fa0c-4381-aa90-2bfd04b8eb7c node=Node longevity-large-partitions-4d-4-6-loader-node-e2adc2e9-1 [13.48.13.196 | 10.0.3.125] (seed: False)
It looks like SCT encountered a problem uploading the coredump to s3:
< t:2022-02-04 19:30:02,331 f:file_logger.py l:89 c:sdcm.sct_events.file_logger p:INFO > 2022-02-04 19:30:02.330: (CoreDumpEvent Severity.ERROR) period_type=one-time event_id=39c8e89c-222a-4e89-bfef-a0ad19fe9903 node=Node longevity-large-partitions-4d-4-6-loader-node-e2adc2e9-1 [13.48.13.196 | 10.0.3.125
] (seed: False)
< t:2022-02-04 19:30:33,724 f:coredump.py l:389 c:sdcm.cluster_aws p:ERROR > Node longevity-large-partitions-4d-4-6-loader-node-e2adc2e9-1 [13.48.13.196 | 10.0.3.125] (seed: False): CoredumpExportSystemdThread: Failed to convert date 'Timestamp: Fri 2022-02-04 19:13:57 UTC (16min ago)' (Fri 2022-02-
04 19:13:57 UTC), due to error: time data 'Fri 2022-02-04 19:13:57 UTC' does not match format '%a %Y-%m-%d %H:%M:%S %z'
< t:2022-02-04 19:30:33,725 f:coredump.py l:220 c:sdcm.cluster_aws p:ERROR > Node longevity-large-partitions-4d-4-6-loader-node-e2adc2e9-1 [13.48.13.196 | 10.0.3.125] (seed: False): CoredumpExportSystemdThread: CoreDump[859] has inaccessible corefile, can't upload it
< t:2022-02-04 20:50:58,050 f:file_logger.py l:89 c:sdcm.sct_events.file_logger p:INFO > 2022-02-04 20:50:58.048: (CoreDumpEvent Severity.ERROR) period_type=one-time event_id=c3e3ce43-fa0c-4381-aa90-2bfd04b8eb7c node=Node longevity-large-partitions-4d-4-6-loader-node-e2adc2e9-1 [13.48.13.196 | 10.0.3.125
] (seed: False)
< t:2022-02-04 20:51:58,811 f:coredump.py l:389 c:sdcm.cluster_aws p:ERROR > Node longevity-large-partitions-4d-4-6-loader-node-e2adc2e9-1 [13.48.13.196 | 10.0.3.125] (seed: False): CoredumpExportSystemdThread: Failed to convert date 'Timestamp: Fri 2022-02-04 20:35:03 UTC (16min ago)' (Fri 2022-02-
04 20:35:03 UTC), due to error: time data 'Fri 2022-02-04 20:35:03 UTC' does not match format '%a %Y-%m-%d %H:%M:%S %z'
< t:2022-02-04 20:51:58,811 f:coredump.py l:220 c:sdcm.cluster_aws p:ERROR > Node longevity-large-partitions-4d-4-6-loader-node-e2adc2e9-1 [13.48.13.196 | 10.0.3.125] (seed: False): CoredumpExportSystemdThread: CoreDump[6632] has inaccessible corefile, can't upload it
====================================
Restore Monitor Stack command: $ hydra investigate show-monitor e2adc2e9-28de-4aab-8dd3-5420deabc259
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs e2adc2e9-28de-4aab-8dd3-5420deabc259
Test id: e2adc2e9-28de-4aab-8dd3-5420deabc259
Logs: grafana - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_211619/grafana-screenshot-longevity-large-partition-4days-test-scylla-per-server-metrics-nemesis-20220204_211840-longevity-large-partitions-4d-4-6-monitor-node-e2adc2e9-1.png](https://www.google.com/url?q=https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_211619/grafana-screenshot-longevity-large-partition-4days-test-scylla-per-server-metrics-nemesis-20220204_211840-longevity-large-partitions-4d-4-6-monitor-node-e2adc2e9-1.png%255D(https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_211619/grafana-screenshot-longevity-large-partition-4days-test-scylla-per-server-metrics-nemesis-20220204_211840-longevity-large-partitions-4d-4-6-monitor-node-e2adc2e9-1.png)&source=gmail-html&ust=1644400806565000&usg=AOvVaw3FfBA-mhjEIAtL-7F3JgxY) grafana - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_211619/grafana-screenshot-overview-20220204_211619-longevity-large-partitions-4d-4-6-monitor-node-e2adc2e9-1.png](https://www.google.com/url?q=https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_211619/grafana-screenshot-overview-20220204_211619-longevity-large-partitions-4d-4-6-monitor-node-e2adc2e9-1.png%255D(https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_211619/grafana-screenshot-overview-20220204_211619-longevity-large-partitions-4d-4-6-monitor-node-e2adc2e9-1.png)&source=gmail-html&ust=1644400806565000&usg=AOvVaw2ZecaaF9ftF-uj5bd3z65d) critical - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/critical-e2adc2e9.log.tar.gz](https://www.google.com/url?q=https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/critical-e2adc2e9.log.tar.gz%255D(https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/critical-e2adc2e9.log.tar.gz)&source=gmail-html&ust=1644400806565000&usg=AOvVaw2MVPU_TmCiDmQRnWM5Tp-M) db-cluster - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/db-cluster-e2adc2e9.tar.gz](https://www.google.com/url?q=https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/db-cluster-e2adc2e9.tar.gz%255D(https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/db-cluster-e2adc2e9.tar.gz)&source=gmail-html&ust=1644400806565000&usg=AOvVaw3oalZn4yCxPZrxAc2h4CwH) debug - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/debug-e2adc2e9.log.tar.gz](https://www.google.com/url?q=https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/debug-e2adc2e9.log.tar.gz%255D(https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/debug-e2adc2e9.log.tar.gz)&source=gmail-html&ust=1644400806565000&usg=AOvVaw1E-ViLa0LBqslYlLheyX8u) email_data - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/email_data-e2adc2e9.json.tar.gz](https://www.google.com/url?q=https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/email_data-e2adc2e9.json.tar.gz%255D(https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/email_data-e2adc2e9.json.tar.gz)&source=gmail-html&ust=1644400806565000&usg=AOvVaw2TPSFU8f0o1GrJ-r1k-uyB) error - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/error-e2adc2e9.log.tar.gz](https://www.google.com/url?q=https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/error-e2adc2e9.log.tar.gz%255D(https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/error-e2adc2e9.log.tar.gz)&source=gmail-html&ust=1644400806566000&usg=AOvVaw0P2XY0WTQlEhEEeGTqoQZ1) event - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/events-e2adc2e9.log.tar.gz](https://www.google.com/url?q=https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/events-e2adc2e9.log.tar.gz%255D(https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/events-e2adc2e9.log.tar.gz)&source=gmail-html&ust=1644400806566000&usg=AOvVaw3LQ3ZZVbO9RjibrU8vd2B-) left_processes - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/left_processes-e2adc2e9.log.tar.gz](https://www.google.com/url?q=https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/left_processes-e2adc2e9.log.tar.gz%255D(https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/left_processes-e2adc2e9.log.tar.gz)&source=gmail-html&ust=1644400806566000&usg=AOvVaw35nrd7VqCxRMYLrdwMt7cH) loader-set - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/loader-set-e2adc2e9.tar.gz](https://www.google.com/url?q=https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/loader-set-e2adc2e9.tar.gz%255D(https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/loader-set-e2adc2e9.tar.gz)&source=gmail-html&ust=1644400806566000&usg=AOvVaw04odK_tFaz86XDrmqv8RXC) monitor-set - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/monitor-set-e2adc2e9.tar.gz](https://www.google.com/url?q=https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/monitor-set-e2adc2e9.tar.gz%255D(https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/monitor-set-e2adc2e9.tar.gz)&source=gmail-html&ust=1644400806566000&usg=AOvVaw3NsARY3BuTdlUrZFShnOyq) normal - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/normal-e2adc2e9.log.tar.gz](https://www.google.com/url?q=https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/normal-e2adc2e9.log.tar.gz%255D(https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/normal-e2adc2e9.log.tar.gz)&source=gmail-html&ust=1644400806566000&usg=AOvVaw2YLTrPn9946TN4w2NuOiFv) output - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/output-e2adc2e9.log.tar.gz](https://www.google.com/url?q=https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/output-e2adc2e9.log.tar.gz%255D(https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/output-e2adc2e9.log.tar.gz)&source=gmail-html&ust=1644400806566000&usg=AOvVaw0GwpuKgyScv2xgXFOE-5q0) event - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/raw_events-e2adc2e9.log.tar.gz](https://www.google.com/url?q=https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/raw_events-e2adc2e9.log.tar.gz%255D(https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/raw_events-e2adc2e9.log.tar.gz)&source=gmail-html&ust=1644400806566000&usg=AOvVaw2uG-2E9ad99zdPsbBdv-w8) sct - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/sct-e2adc2e9.log.tar.gz](https://www.google.com/url?q=https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/sct-e2adc2e9.log.tar.gz%255D(https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/sct-e2adc2e9.log.tar.gz)&source=gmail-html&ust=1644400806566000&usg=AOvVaw2LM6z_sWept67xziDCKTYY) summary - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/summary-e2adc2e9.log.tar.gz](https://www.google.com/url?q=https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/summary-e2adc2e9.log.tar.gz%255D(https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/summary-e2adc2e9.log.tar.gz)&source=gmail-html&ust=1644400806566000&usg=AOvVaw3BRMZOe4Lz_KWgA9irnCnC) warning - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/warning-e2adc2e9.log.tar.gz](https://www.google.com/url?q=https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/warning-e2adc2e9.log.tar.gz%255D(https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/warning-e2adc2e9.log.tar.gz)&source=gmail-html&ust=1644400806566000&usg=AOvVaw29-SvCMANTsOUzGrZgpt57)
The role backs up and snapshots everything without checking whether the user provided version is a valid version upgrade OR whether it is available. It does check - however - whether the user provided input is valid from a semantics perspective (only) within upgrade/main.yml
where it is able to catch most of user mistakes.
Therefore I'd consider this issue as an enhancement, rather than a bug per se.
I will submit a commit to address some aspects of the upgrade logic (for example, the fact that an upgrade to latest
doesn't work), and include a check to determine whether there are upgrades available, which will fit nicely when one specifies latest
but already is under whatever latest is. ;-)