scylla-manager icon indicating copy to clipboard operation
scylla-manager copied to clipboard

One of the nodes fails with `No space left on device` during manager-related nemeses in 24-hour Cloud longevity test

Open ilya-rarov opened this issue 1 year ago • 16 comments

Issue description

  • [ ] This issue is a regression.
  • [x] It is unknown if this issue is a regression.

I am not 100% certain this problem is caused by the manager, but it seems in all the occurrences it happened while one of the manager's tasks was running (maybe this is just a coincidence though). If you think it has nothing to do with the manager please move this issue to a different repo (I guess - to scylla-enteprise)

In the problematic longevity build we created a new Cloud cluster in Staging and triggered the following nemeses on it:

  1. 2023-08-04 08:17:20 - 2023-08-04 08:52:20 - disrupt_rolling_restart_cluster
  2. 2023-08-04 09:23:01 - 2023-08-04 23:34:46 - disrupt_mgmt_backup

In parallel the load with the following config was running:

prepare_write_cmd:
  - "cassandra-stress write cl=QUORUM n=10485760 -schema 'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=30 -pop seq=1..10485760 -col 'n=FIXED(10) size=FIXED(512)' -log interval=5"

stress_cmd:
  - "cassandra-stress write cl=QUORUM duration=1400m -schema 'replication(factor=3) compaction (strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=10 -pop seq=1..10485760 -col 'n=FIXED(10) size=FIXED(512)' -log interval=5"
  - "cassandra-stress read  cl=QUORUM duration=1400m -schema 'replication(factor=3) compaction (strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=10 -pop seq=1..10485760 -col 'n=FIXED(10) size=FIXED(512)' -log interval=5"

During the disrupt_mgmt_backup nemesis, the backup task was started via sctool

< t:2023-08-04 09:23:37,080 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/opt/scylla/scylla-cloud -c /opt/scylla/scylla-dbaas.yml cluster manager sctool --cluster-id 11232 --  backup  --location AWS_EU_WEST_1:s3:scylla-cloud-backup-11232-11016-v80c7h"...

While the task was running the storage utilization on the node scylla-cloud-operations-24h-devel-db-node-f04b5124-1 (172.18.122.20) spiked and eventually hit 100%. The node threw the error No space left on device:

2023-08-04 17:44:31.526 <2023-08-04 17:44:31.162>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=c1b5b322-9df8-41e9-ab26-fbb18cd9d7f1: type=NO_SPACE_ERROR regex=No space left on device line_number=7273 node=scylla-cloud-operations-24h-devel-db-node-f04b5124-1
2023-08-04T17:44:31.162 ip-172-18-122-20 !ERROR | scylla[24388]:  [shard 0] storage_service - Shutting down communications due to I/O errors until operator intervention: Disk error: std::system_error (error system:28, No space left on device)

Here are the storage utilization graphs from the monitor during the whole build (red line is for the problematic node) image

image Some other graphs from the monitor image

image

image

image

image

image

Impact

One of the cluster nodes runs out of the storage space and stops responding

How frequently does it reproduce?

It reproduced a few times in a row. The other reproducers will be added in the following comments

Installation details

Kernel Version: 5.15.0-1039-aws Scylla version (or git commit hash): 2022.2.11-20230705.27d29485de90 with build-id f467a0ad8869d61384d8bbc8f20e4fb8fd281f4b

Cloud manager version: 3.1.0

Cluster size: 6 nodes (i4i.large)

Scylla Nodes used in this run:

  • scylla-cloud-operations-24h-devel-db-node-f04b5124-6 (null | 172.18.120.40) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-f04b5124-5 (null | 172.18.120.12) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-f04b5124-4 (null | 172.18.121.241) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-f04b5124-3 (null | 172.18.122.54) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-f04b5124-2 (null | 172.18.121.23) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-f04b5124-1 (null | 172.18.122.20) (shards: 2)

OS / Image: `` (aws: undefined_region)

Test: scylla-cloud-longevity-terraform-operations-24h-aws Test id: f04b5124-5beb-4019-b3c1-3559c4726f7d Test name: siren-tests/longevity-tests/staging/scylla-cloud-longevity-terraform-operations-24h-aws Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor f04b5124-5beb-4019-b3c1-3559c4726f7d
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs f04b5124-5beb-4019-b3c1-3559c4726f7d

Logs:

Jenkins job URL Argus

ilya-rarov avatar Aug 28 '23 17:08 ilya-rarov

Issue description

Next repro.

The nemeses:

  1. 2023-08-11 08:22:18 - 2023-08-11 14:44:27 - disrupt_cloud_replace_responsive_node
  2. 2023-08-11 15:08:55 - 2023-08-11 17:08:18 - disrupt_grow_shrink_cluster
  3. 2023-08-11 17:25:47 - 2023-08-12 08:08:34 - disrupt_mgmt_backup_specific_keyspaces

The load settings were the same as for the previous build.

At the beginning of the disrupt_mgmt_backup_specific_keyspaces nemesis, we started a new manager backup task for a specific keyspace - keyspace1

< t:2023-08-11 17:26:25,320 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/opt/scylla/scylla-cloud -c /opt/scylla/scylla-dbaas.yml cluster manager sctool --cluster-id 11370 --  backup  --keyspace keyspace1  --location AWS_EU_WEST_1:s3:scylla-cloud-backup-11370-11154-610zsa"...

While the task was running the node scylla-cloud-operations-24h-devel-db-node-cd335ce7-6 (172.23.24.135) ran out of space

2023-08-11 18:35:53.521 <2023-08-11 18:35:53.461>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=a0e30ea6-1d07-420d-9f94-778cb70d2a1e: type=NO_SPACE_ERROR regex=No space left on device line_number=4614 node=scylla-cloud-operations-24h-devel-db-node-cd335ce7-6
2023-08-11T18:35:53.461 ip-172-23-24-135 !ERROR | scylla[23420]:  [shard 0] storage_service - Shutting down communications due to I/O errors until operator intervention: Disk error: std::system_error (error system:28, No space left on device)

Screenshots from Grafana (yellow line): image image image image image image image image

Installation details

Kernel Version: 5.15.0-1039-aws Scylla version (or git commit hash): 2022.2.11-20230705.27d29485de90 with build-id f467a0ad8869d61384d8bbc8f20e4fb8fd281f4b

Cloud manager version: 3.1.0

Cluster size: 6 nodes (i4i.large)

Scylla Nodes used in this run:

  • scylla-cloud-operations-24h-devel-db-node-cd335ce7-8 (null | 172.23.24.87) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-cd335ce7-7 (null | 172.23.25.29) (shards: -1)
  • scylla-cloud-operations-24h-devel-db-node-cd335ce7-6 (null | 172.23.24.135) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-cd335ce7-5 (null | 172.23.24.108) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-cd335ce7-4 (null | 172.23.25.192) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-cd335ce7-3 (null | 172.23.26.231) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-cd335ce7-2 (null | 172.23.25.30) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-cd335ce7-1 (null | 172.23.26.228) (shards: 2)

OS / Image: `` (aws: undefined_region)

Test: scylla-cloud-longevity-terraform-operations-24h-aws Test id: cd335ce7-7bb4-4bdb-8777-cb25249815b6 Test name: siren-tests/longevity-tests/staging/scylla-cloud-longevity-terraform-operations-24h-aws Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor cd335ce7-7bb4-4bdb-8777-cb25249815b6
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs cd335ce7-7bb4-4bdb-8777-cb25249815b6

Logs:

Jenkins job URL Argus

ilya-rarov avatar Aug 28 '23 18:08 ilya-rarov

Issue description

One more repro.

The nemeses list:

  1. 2023-08-18 08:20:36 - 2023-08-18 08:56:24 - disrupt_rolling_restart_cluster
  2. 2023-08-18 09:31:14 - 2023-08-19 00:32:32 - disrupt_mgmt_backup
  3. 2023-08-19 01:05:45 - 2023-08-19 08:14:45 - disrupt_cloud_replace_non_responsive_node

The load settings were the same as for the previous build.

At the beginning of the nemesis disrupt_mgmt_backup we started a new backup task

< t:2023-08-18 09:31:51,270 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/opt/scylla/scylla-cloud -c /opt/scylla/scylla-dbaas.yml cluster manager sctool --cluster-id 11451 --  backup  --location AWS_EU_WEST_1:s3:scylla-cloud-backup-11451-11234-7ua31o"...

This task had been running longer than expected 15 hours and the nemesis failed.

< t:2023-08-19 00:32:32,479 f:nemesis.py      l:4837 c:sdcm.nemesis         p:ERROR > nemesis_cloud.CloudOperationMonkey: Unhandled exception in method <bound method Nemesis.disrupt_mgmt_backup of <nemesis_cloud.CloudOperationMonkey object at 0x7fd6d4523bb0>> < t:2023-08-19 00:32:32,479 f:nemesis.py      l:4837 c:sdcm.nemesis         p:ERROR > nemesis_cloud.CloudOperationMonkey: Unhandled exception in method <bound method Nemesis.disrupt_mgmt_backup of <nemesis_cloud.CloudOperationMonkey object at 0x7fd6d4523bb0>>
< t:2023-08-19 00:32:32,479 f:nemesis.py      l:4837 c:sdcm.nemesis         p:ERROR > Traceback (most recent call last):
< t:2023-08-19 00:32:32,479 f:nemesis.py      l:4837 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/wait.py", line 69, in wait_for
< t:2023-08-19 00:32:32,479 f:nemesis.py      l:4837 c:sdcm.nemesis         p:ERROR >     res = retry(func, **kwargs)
< t:2023-08-19 00:32:32,479 f:nemesis.py      l:4837 c:sdcm.nemesis         p:ERROR >   File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 404, in __call__
< t:2023-08-19 00:32:32,479 f:nemesis.py      l:4837 c:sdcm.nemesis         p:ERROR >     do = self.iter(retry_state=retry_state)
< t:2023-08-19 00:32:32,479 f:nemesis.py      l:4837 c:sdcm.nemesis         p:ERROR >   File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 360, in iter
< t:2023-08-19 00:32:32,479 f:nemesis.py      l:4837 c:sdcm.nemesis         p:ERROR >     raise retry_exc.reraise()
< t:2023-08-19 00:32:32,479 f:nemesis.py      l:4837 c:sdcm.nemesis         p:ERROR >   File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 194, in reraise
< t:2023-08-19 00:32:32,479 f:nemesis.py      l:4837 c:sdcm.nemesis         p:ERROR >     raise self
< t:2023-08-19 00:32:32,479 f:nemesis.py      l:4837 c:sdcm.nemesis         p:ERROR > tenacity.RetryError: RetryError[<Future at 0x7fd6d4177610 state=finished returned bool>]
< t:2023-08-19 00:32:32,479 f:nemesis.py      l:4837 c:sdcm.nemesis         p:ERROR > 
< t:2023-08-19 00:32:32,479 f:nemesis.py      l:4837 c:sdcm.nemesis         p:ERROR > The above exception was the direct cause of the following exception:
< t:2023-08-19 00:32:32,479 f:nemesis.py      l:4837 c:sdcm.nemesis         p:ERROR > 
< t:2023-08-19 00:32:32,479 f:nemesis.py      l:4837 c:sdcm.nemesis         p:ERROR > Traceback (most recent call last):
< t:2023-08-19 00:32:32,479 f:nemesis.py      l:4837 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/mgmt/cli.py", line 399, in wait_for_status
< t:2023-08-19 00:32:32,479 f:nemesis.py      l:4837 c:sdcm.nemesis         p:ERROR >     return wait.wait_for(func=self.is_status_in_list, step=step, throw_exc=True,
< t:2023-08-19 00:32:32,479 f:nemesis.py      l:4837 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/wait.py", line 80, in wait_for
< t:2023-08-19 00:32:32,479 f:nemesis.py      l:4837 c:sdcm.nemesis         p:ERROR >     raise WaitForTimeoutError(err) from ex
< t:2023-08-19 00:32:32,479 f:nemesis.py      l:4837 c:sdcm.nemesis         p:ERROR > sdcm.wait.WaitForTimeoutError: Wait for: Waiting until task: backup/36584ea2-dd92-44f1-8b78-6fe0e519161e reaches status of: ['ERROR (4/4)', 'DONE']: timeout - 54000 seconds - expired

But the task itself kept on running until it was paused during the next nemesis (disrupt_cloud_replace_non_responsive_node) before launching replace-dead operation

< t:2023-08-19 01:06:48,353 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/opt/scylla/scylla-cloud -c /opt/scylla/scylla-dbaas.yml cluster manager pause-all-jobs --cluster-id 11451"...

After the replace-dead succeeded the backup task was resumed

< t:2023-08-19 04:02:10,087 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/opt/scylla/scylla-cloud -c /opt/scylla/scylla-dbaas.yml cluster manager resume-all-jobs --cluster-id 11451"...
< t:2023-08-19 04:22:19,210 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > {"level":"debug","ts":1692418938.5278873,"caller":"manager/manager.go:100","msg":"Running Manager task is pending","clusterID":11451,"taskID":"36584ea2-dd92-44f1-8b78-6fe0e519161e","taskName":"","taskType":"backup","taskStatus":"RUNNING"}

Here is the screenshot from manager dashboard representing the life cycle of this task image

After the task was resumed the node scylla-cloud-operations-24h-devel-db-node-9a05f3fd-1 (172.29.130.95) ran out of space.

2023-08-19 07:40:00.189 <2023-08-19 07:39:59.756>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=26dea975-1e63-4af5-9673-a28113c09d63: type=NO_SPACE_ERROR regex=No space left on device line_number=7703 node=scylla-cloud-operations-24h-devel-db-node-9a05f3fd-1
2023-08-19T07:39:59.756 ip-172-29-130-95 !ERROR | scylla[22633]:  [shard 0] storage_service - Shutting down communications due to I/O errors until operator intervention: Disk error: std::system_error (error system:28, No space left on device)

In these Grafana graphs, this node is presented with the purple line image image image image image image image image

Another node that ran out of space was scylla-cloud-operations-24h-devel-db-node-9a05f3fd-7 (172.29.129.76) - red line in the pictures above

2023-08-19T06:43:03.406 ip-172-29-129-76 !ERROR | scylla[11079]:  [shard 1] storage_service - Shutting down communications due to I/O errors until operator intervention: Disk error: std::system_error (error system:28, No space left on device)

It was a new node joined the cluster after replace-dead operation.

Installation details

Kernel Version: 5.15.0-1040-aws Scylla version (or git commit hash): 2022.2.12-20230727.f4448d5b0265 with build-id a87bfeb65d24abf65d074a3ba2e5b9664692d716

Cloud manager version: 3.1.0

Cluster size: 6 nodes (i4i.large)

Scylla Nodes used in this run:

  • scylla-cloud-operations-24h-devel-db-node-9a05f3fd-7 (null | 172.29.129.76) (shards: -1)
  • scylla-cloud-operations-24h-devel-db-node-9a05f3fd-6 (null | 172.29.128.19) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-9a05f3fd-5 (null | 172.29.129.41) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-9a05f3fd-4 (null | 172.29.130.128) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-9a05f3fd-3 (null | 172.29.129.157) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-9a05f3fd-2 (null | 172.29.128.49) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-9a05f3fd-1 (null | 172.29.130.95) (shards: 2)

OS / Image: `` (aws: undefined_region)

Test: scylla-cloud-longevity-terraform-operations-24h-aws Test id: 9a05f3fd-2e20-4a28-afe7-7823846206c5 Test name: siren-tests/longevity-tests/staging/scylla-cloud-longevity-terraform-operations-24h-aws Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 9a05f3fd-2e20-4a28-afe7-7823846206c5
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 9a05f3fd-2e20-4a28-afe7-7823846206c5

Logs:

Jenkins job URL Argus

ilya-rarov avatar Aug 28 '23 19:08 ilya-rarov

Issue description

Another repro

The nemeses list:

  1. 2023-08-25 08:22:12 - 2023-08-25 08:41:44 - disrupt_mgmt_repair_cli
  2. 2023-08-25 09:18:53 - 2023-08-25 17:29:48 - disrupt_cloud_replace_responsive_node
  3. 2023-08-25 17:46:55 - 2023-08-25 19:36:20 - disrupt_grow_shrink_cluster
  4. 2023-08-25 19:49:55 - 2023-08-26 04:39:00 - disrupt_mgmt_repair_cli

The load settings were the same as for the first build.

When the nemesis disrupt_mgmt_repair_cli was triggered for the 2nd time the new repair task was created

< t:2023-08-25 19:49:55,962 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/opt/scylla/scylla-cloud -c /opt/scylla/scylla-dbaas.yml cluster manager sctool --cluster-id 11595 --  repair"...

While this task was running the node scylla-cloud-operations-24h-devel-db-node-95b9d688-6 (172.27.152.183) ran out of space and stopped responding

2023-08-26 04:00:39.960 <2023-08-26 04:00:39.781>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=8364e899-4d8f-4448-92fd-014346e3300b: type=NO_SPACE_ERROR regex=No space left on device line_number=49178 node=scylla-cloud-operations-24h-devel-db-node-95b9d688-6
2023-08-26T04:00:39.781 ip-172-27-152-183 !ERROR | scylla[21630]:  [shard 1] storage_service - Shutting down communications due to I/O errors until operator intervention: Disk error: std::system_error (error system:28, No space left on device)

In the screenshots from Grafana the storage utilization for this node is represented with a blue line image image Some other graphs from the monitor image image image image image image

The repair task eventually failed

2023-08-26 04:39:00.425: (DisruptionEvent Severity.ERROR) period_type=end event_id=aa25f829-e86e-4518-8bac-25f94c2b9399 duration=8h49m5s: nemesis_name=MgmtRepairCli target_node=Node scylla-cloud-operations-24h-devel-db-node-95b9d688-1 [None | 172.27.154.211] (seed: True) errors=Task: repair/9f18425c-1874-448d-86c3-9553df8e5ed8 final status is: ERROR.
Task progress string: Run:		9395db84-4380-11ee-bf94-0a2270aa6e79
Status:		ERROR
Cause:		repair 122 token ranges out of 15360
Start time:	25 Aug 23 19:50:00 UTC
End time:	26 Aug 23 04:36:32 UTC
Duration:	8h46m32s
Progress:	99%/1%
Datacenters:
- AWS_EU_WEST_1
╭───────────────────────────────┬────────────────────────────────┬──────────┬──────────╮
│ Keyspace                      │                          Table │ Progress │ Duration │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ keyspace1                     │                      standard1 │ 92%/8%   │ 8h25m54s │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_auth                   │                role_attributes │ 100%     │ 2m48s    │
│ system_auth                   │                   role_members │ 100%     │ 2m36s    │
│ system_auth                   │               role_permissions │ 100%     │ 2m47s    │
│ system_auth                   │                          roles │ 100%     │ 2m23s    │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_distributed_everywhere │ cdc_generation_descriptions_v2 │ 100%     │ 0s       │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_distributed            │      cdc_generation_timestamps │ 100%     │ 2m21s    │
│ system_distributed            │    cdc_streams_descriptions_v2 │ 100%     │ 2m32s    │
│ system_distributed            │                 service_levels │ 100%     │ 2m16s    │
│ system_distributed            │              view_build_status │ 100%     │ 2m41s    │
╰───────────────────────────────┴────────────────────────────────┴──────────┴──────────╯
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4911, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 178, in wrapped
res = func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2807, in disrupt_mgmt_repair_cli
raise ScyllaManagerError(
sdcm.mgmt.common.ScyllaManagerError: Task: repair/9f18425c-1874-448d-86c3-9553df8e5ed8 final status is: ERROR.
Task progress string: Run:		9395db84-4380-11ee-bf94-0a2270aa6e79
Status:		ERROR
Cause:		repair 122 token ranges out of 15360
Start time:	25 Aug 23 19:50:00 UTC
End time:	26 Aug 23 04:36:32 UTC
Duration:	8h46m32s
Progress:	99%/1%
Datacenters:
- AWS_EU_WEST_1
╭───────────────────────────────┬────────────────────────────────┬──────────┬──────────╮
│ Keyspace                      │                          Table │ Progress │ Duration │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ keyspace1                     │                      standard1 │ 92%/8%   │ 8h25m54s │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_auth                   │                role_attributes │ 100%     │ 2m48s    │
│ system_auth                   │                   role_members │ 100%     │ 2m36s    │
│ system_auth                   │               role_permissions │ 100%     │ 2m47s    │
│ system_auth                   │                          roles │ 100%     │ 2m23s    │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_distributed_everywhere │ cdc_generation_descriptions_v2 │ 100%     │ 0s       │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_distributed            │      cdc_generation_timestamps │ 100%     │ 2m21s    │
│ system_distributed            │    cdc_streams_descriptions_v2 │ 100%     │ 2m32s    │
│ system_distributed            │                 service_levels │ 100%     │ 2m16s    │
│ system_distributed            │              view_build_status │ 100%     │ 2m41s    │
╰───────────────────────────────┴────────────────────────────────┴──────────┴──────────╯

Installation details

Kernel Version: 5.15.0-1040-aws Scylla version (or git commit hash): 2022.2.12-20230727.f4448d5b0265 with build-id a87bfeb65d24abf65d074a3ba2e5b9664692d716

Cloud manager version: 3.2.0

Cluster size: 6 nodes (i4i.large)

Scylla Nodes used in this run:

  • scylla-cloud-operations-24h-devel-db-node-95b9d688-8 (null | 172.27.152.133) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-95b9d688-7 (null | 172.27.152.145) (shards: -1)
  • scylla-cloud-operations-24h-devel-db-node-95b9d688-6 (null | 172.27.152.183) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-95b9d688-5 (null | 172.27.153.34) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-95b9d688-4 (null | 172.27.152.86) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-95b9d688-3 (null | 172.27.154.160) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-95b9d688-2 (null | 172.27.153.73) (shards: 2)
  • scylla-cloud-operations-24h-devel-db-node-95b9d688-1 (null | 172.27.154.211) (shards: 2)

OS / Image: `` (aws: undefined_region)

Test: scylla-cloud-longevity-terraform-operations-24h-aws Test id: 95b9d688-ce66-4f15-ae37-576501193062 Test name: siren-tests/longevity-tests/staging/scylla-cloud-longevity-terraform-operations-24h-aws Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 95b9d688-ce66-4f15-ae37-576501193062
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 95b9d688-ce66-4f15-ae37-576501193062

Logs:

Jenkins job URL Argus

ilya-rarov avatar Aug 28 '23 19:08 ilya-rarov

@roydahan - FYI

rayakurl avatar Aug 29 '23 08:08 rayakurl

I'm taking a look at that, but I would be very surprised if SM is the root cause of that, because I don't see how repair could cause node to be OOM.

Michal-Leszczynski avatar Aug 29 '23 10:08 Michal-Leszczynski

I'm taking a look at that, but I would be very surprised if SM is the root cause of that, because I don't see how repair could cause node to be OOM.

Thanks @Michal-Leszczynski , feel free to move the issue for siren-enterprise if the SM is not the root cause.

rayakurl avatar Aug 29 '23 10:08 rayakurl

When the nemesis disrupt_mgmt_repair_cli was triggered for the 2nd time the new repair task was created

< t:2023-08-25 19:49:55,962 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/opt/scylla/scylla-cloud -c /opt/scylla/scylla-dbaas.yml cluster manager sctool --cluster-id 11595 --  repair"...

While this task was running the node scylla-cloud-operations-24h-devel-db-node-95b9d688-6 (172.27.152.183) ran out of space and stopped responding

This is strange, as the logs suggest that node scylla-cloud-operations-24h-devel-db-node-95b9d688-6 hasn't been taking part in any repair (except for the first run of disrupt_mgmt_repair_cli), so repair shouldn't be the cause for that.

Unfortunately, I am not familiar with used test scenarios and SM logs seems to be empty, so it would be good if I could get some information about that. Especially about repair task configuration, what other SM tasks were running etc.

Michal-Leszczynski avatar Aug 29 '23 12:08 Michal-Leszczynski

It's probably the same issue @ShlomiBalalis saw multiple times - repair has very very elaborate logs, and those logs fill the disk. My question (as asked before in such bugs) - what is consuming the whole storage?

mykaul avatar Aug 29 '23 13:08 mykaul

Especially about repair task configuration, what other SM tasks were running etc.

AFAICT, we did not apply any special configuration to the repair task. We just ran it using the following command:

/opt/scylla/scylla-cloud -c /opt/scylla/scylla-dbaas.yml cluster manager sctool --cluster-id 11595 --  repair

I guess the configuration was default.

Regarding the other tasks. We regularly requested the status of the manager's tasks while the repair was running Here is the first status (repair had just started)

< t:2023-08-25 19:50:05,357 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/opt/scylla/scylla-cloud -c /opt/scylla/scylla-dbaas.yml cluster manager sctool --cluster-id 11595 --  tasks"...
╭─────────────────────────────────────────────┬─────────────┬────────┬──────────┬─────────┬───────┬────────────────────────┬────────────────────────┬─────────┬────────────────────────╮
│ Task                                        │ Schedule    │ Window │ Timezone │ Success │ Error │ Last Success           │ Last Error             │ Status  │ Next                   │
├─────────────────────────────────────────────┼─────────────┼────────┼──────────┼─────────┼───────┼────────────────────────┼────────────────────────┼─────────┼────────────────────────┤
│ backup/ScyllaCloud-#11595                   │ 1d          │        │          │ 0       │ 3     │                        │ 25 Aug 23 17:37:45 UTC │ ERROR   │ 25 Aug 23 20:08:50 UTC │
│ healthcheck/cql                             │ @every 15s  │        │ Etc/UTC  │ 2686    │ 0     │ 25 Aug 23 19:50:04 UTC │                        │ DONE    │ 25 Aug 23 19:50:19 UTC │
│ healthcheck/rest                            │ @every 1m0s │        │ Etc/UTC  │ 698     │ 0     │ 25 Aug 23 19:49:35 UTC │                        │ DONE    │ 25 Aug 23 19:50:35 UTC │
│ healthcheck/alternator                      │ @every 15s  │        │ Etc/UTC  │ 2753    │ 0     │ 25 Aug 23 19:50:07 UTC │                        │ DONE    │ 25 Aug 23 19:50:22 UTC │
│ repair/214e70f0-f8ff-45fb-b491-6e66db80f672 │ 7d          │        │          │ 0       │ 0     │                        │                        │ NEW     │ 26 Aug 23 07:39:41 UTC │
│ repair/4be13ea9-045f-418e-a7ba-67b991f31b95 │             │        │ Etc/UTC  │ 1       │ 0     │ 25 Aug 23 08:35:09 UTC │                        │ DONE    │                        │
│ repair/9f18425c-1874-448d-86c3-9553df8e5ed8 │             │        │ Etc/UTC  │ 0       │ 0     │                        │                        │ RUNNING │                        │
╰─────────────────────────────────────────────┴─────────────┴────────┴──────────┴─────────┴───────┴────────────────────────┴────────────────────────┴─────────┴────────────────────────╯

Here is one of the last statuses - the repair had just failed

< t:2023-08-26 04:38:56,036 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/opt/scylla/scylla-cloud -c /opt/scylla/scylla-dbaas.yml cluster manager sctool --cluster-id 11595 --  tasks"...
╭─────────────────────────────────────────────┬─────────────┬────────┬──────────┬─────────┬───────┬────────────────────────┬────────────────────────┬─────────────┬────────────────────────╮
│ Task                                        │ Schedule    │ Window │ Timezone │ Success │ Error │ Last Success           │ Last Error             │ Status      │ Next                   │
├─────────────────────────────────────────────┼─────────────┼────────┼──────────┼─────────┼───────┼────────────────────────┼────────────────────────┼─────────────┼────────────────────────┤
│ backup/ScyllaCloud-#11595                   │ 1d          │        │          │ 0       │ 7     │                        │ 25 Aug 23 21:19:12 UTC │ ERROR       │ 26 Aug 23 20:08:50 UTC │
│ healthcheck/cql                             │ @every 15s  │        │ Etc/UTC  │ 4730    │ 0     │ 26 Aug 23 04:38:56 UTC │                        │ DONE        │ 26 Aug 23 04:39:11 UTC │
│ healthcheck/rest                            │ @every 1m0s │        │ Etc/UTC  │ 1227    │ 0     │ 26 Aug 23 04:38:36 UTC │                        │ DONE        │ 26 Aug 23 04:39:36 UTC │
│ healthcheck/alternator                      │ @every 15s  │        │ Etc/UTC  │ 4863    │ 0     │ 26 Aug 23 04:38:45 UTC │                        │ DONE        │ 26 Aug 23 04:39:00 UTC │
│ repair/214e70f0-f8ff-45fb-b491-6e66db80f672 │ 7d          │        │          │ 0       │ 0     │                        │                        │ NEW         │ 26 Aug 23 07:39:41 UTC │
│ repair/4be13ea9-045f-418e-a7ba-67b991f31b95 │             │        │ Etc/UTC  │ 1       │ 0     │ 25 Aug 23 08:35:09 UTC │                        │ DONE        │                        │
│ repair/9f18425c-1874-448d-86c3-9553df8e5ed8 │             │        │ Etc/UTC  │ 0       │ 1     │                        │ 26 Aug 23 04:36:32 UTC │ ERROR (0/3) │ 26 Aug 23 04:46:32 UTC │
╰─────────────────────────────────────────────┴─────────────┴────────┴──────────┴─────────┴───────┴────────────────────────┴────────────────────────┴─────────────┴────────────────────────╯

The list of tasks is the same - looks like there were no other running tasks during the repair.

ilya-rarov avatar Aug 29 '23 16:08 ilya-rarov

It can't be the logs because it happens only on one node and the pattern is that space is going up but also down.

@ilya-rarov please correlate the used space graph also with running compactions and possibly other advanced graphs that shows difference between this node behavior to others (e.g slower disk).

Lastly, how often does it reproduce?

roydahan avatar Aug 29 '23 23:08 roydahan

@roydahan, I updated all the comments about the reproducers with some additional graphs. Please let me know if you think that I need to add anything else. Looks like this problem reproduces in every build of this Cloud longevity (starting from the 4th)

ilya-rarov avatar Sep 04 '23 18:09 ilya-rarov

@ilya-rarov it doesn't help just throwing more and more information from reproducers. We need to find an exact reproducer. It seems like the reproducer is the backup nemesis, not repair.

In summary, we need to understand:

  1. What is the raw dataset size (calculation from c-s prepare command).
  2. What is the size on disk after prepare. (metric from monitor)
  3. What is the size on disk before backup (metric from monitor)
  4. What is the size on disk after backup (metric from monitor).

Then we should make sure that: "data on disk before backup" + "backup temporary overhead" < "Size of disk of i4i.large"

Another easy way to test it is to increase to i4i.xlarge or 2xlarge and see till what size the "used disk" metric reach.

roydahan avatar Sep 13 '23 19:09 roydahan

@ilya-rarov - As far as I remember, you run this on a larger instance type, right?

@roydahan - even if we exceed the disk size, how come only one node is getting out of space? All the nodes should be balanced if this is the case

rayakurl avatar Sep 13 '23 19:09 rayakurl

I don't know. Maybe the manager does its operations sequentially on the nodes and not in parallel.

Another way to debug it is to keep the cluster alive and check what's taking the space on the disk.

(Start by du -sh /var/lib/scylla/* and dig into it, while comparing to another node.

roydahan avatar Sep 13 '23 20:09 roydahan

@ilya-rarov - As far as I remember, you run this on a larger instance type, right?

That's right, I built a Job using i3en.large which has larger storage than i4i.large.

The storage utilization plots from Grafana:

image image

This time none of the nodes got No space left on device (I guess - thanks to larger storage). But one of the nodes still stood out a bit utilizing more space than the others. The spike happened during MgmtBackup nemesis which started on at 2023-09-10 20:28:29 and failed at 2023-09-11 11:29:28 because it was running longer than the test expected.

2023-09-11 11:29:28.753: (DisruptionEvent Severity.ERROR) period_type=end event_id=4aae14e5-153a-4f2d-a3a1-8b41003c93f1 duration=15h0m59s: nemesis_name=MgmtBackup target_node=Node scylla-cloud-operations-24h-devel-db-node-b52dea9a-3 [None | 172.20.208.74] (seed: True) errors=Failed on waiting until task: backup/b4ca2993-e8d2-463e-81bb-98cb7cb3b7dc reaches status of ['ERROR (4/4)', 'DONE']: current task status RUNNING: Wait for: Waiting until task: backup/b4ca2993-e8d2-463e-81bb-98cb7cb3b7dc reaches status of: ['ERROR (4/4)', 'DONE']: timeout - 54000 seconds - expired
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/wait.py", line 69, in wait_for
res = retry(func, **kwargs)
File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 404, in __call__
do = self.iter(retry_state=retry_state)
File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 360, in iter
raise retry_exc.reraise()
File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 194, in reraise
raise self
tenacity.RetryError: RetryError[<Future at 0x7fe14bd16bc0 state=finished returned bool>]
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/mgmt/cli.py", line 399, in wait_for_status
return wait.wait_for(func=self.is_status_in_list, step=step, throw_exc=True,
File "/home/ubuntu/scylla-cluster-tests/sdcm/wait.py", line 80, in wait_for
raise WaitForTimeoutError(err) from ex
sdcm.wait.WaitForTimeoutError: Wait for: Waiting until task: backup/b4ca2993-e8d2-463e-81bb-98cb7cb3b7dc reaches status of: ['ERROR (4/4)', 'DONE']: timeout - 54000 seconds - expired
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5011, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2724, in disrupt_mgmt_backup
self._mgmt_backup(backup_specific_tables=False)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2845, in _mgmt_backup
status = mgr_task.wait_and_get_final_status(timeout=54000, step=5, only_final=True)
File "/home/ubuntu/scylla-cluster-tests/sdcm/mgmt/cli.py", line 438, in wait_and_get_final_status
res = self.wait_for_status(list_status=list_final_status, timeout=timeout, step=step)
File "/home/ubuntu/scylla-cluster-tests/sdcm/mgmt/cli.py", line 403, in wait_for_status
raise WaitForTimeoutError(
sdcm.wait.WaitForTimeoutError: Failed on waiting until task: backup/b4ca2993-e8d2-463e-81bb-98cb7cb3b7dc reaches status of ['ERROR (4/4)', 'DONE']: current task status RUNNING: Wait for: Waiting until task: backup/b4ca2993-e8d2-463e-81bb-98cb7cb3b7dc reaches status of: ['ERROR (4/4)', 'DONE']: timeout - 54000 seconds - expired

According to Grafana the task reached only 43.6% at that point image

Installation details

Scylla version (or git commit hash): 2022.2.13-0.20230824.575d727536c7 with build-id 712dd5b48dded7a24551b303738ab273fa7d82e2 Cluster size: 6 nodes (i3en.large) Scylla running with shards number (live nodes):

  • scylla-cloud-operations-24h-devel-db-node-b52dea9a-1 (None | 172.20.210.253): 2 shards
  • scylla-cloud-operations-24h-devel-db-node-b52dea9a-2 (None | 172.20.209.144): 2 shards
  • scylla-cloud-operations-24h-devel-db-node-b52dea9a-3 (None | 172.20.208.74): 2 shards
  • scylla-cloud-operations-24h-devel-db-node-b52dea9a-4 (None | 172.20.208.244): 2 shards
  • scylla-cloud-operations-24h-devel-db-node-b52dea9a-5 (None | 172.20.210.16): 2 shards
  • scylla-cloud-operations-24h-devel-db-node-b52dea9a-6 (None | 172.20.209.218): 2 shards

Restore Monitor Stack command: $ hydra investigate show-monitor b52dea9a-13ac-4729-9c4e-ce78654e9ec6 Restore monitor on AWS instance using Jenkins job Show all stored logs command: $ hydra investigate show-logs b52dea9a-13ac-4729-9c4e-ce78654e9ec6

Test id: b52dea9a-13ac-4729-9c4e-ce78654e9ec6

Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/b52dea9a-13ac-4729-9c4e-ce78654e9ec6/20230910_200731/grafana-screenshot-overview-20230910_200732-scylla-cloud-operations-24h-devel-monitor-node-b52dea9a-1.png grafana - https://cloudius-jenkins-test.s3.amazonaws.com/b52dea9a-13ac-4729-9c4e-ce78654e9ec6/20230910_200731/grafana-screenshot-scylla-cloud-longevity-terraform-operations-24h-aws-ilya-scylla-per-server-metrics-nemesis-20230910_200851-scylla-cloud-operations-24h-devel-monitor-node-b52dea9a-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/b52dea9a-13ac-4729-9c4e-ce78654e9ec6/20230911_161440/db-cluster-b52dea9a.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/b52dea9a-13ac-4729-9c4e-ce78654e9ec6/20230911_161440/loader-set-b52dea9a.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/b52dea9a-13ac-4729-9c4e-ce78654e9ec6/20230911_161440/monitor-set-b52dea9a.tar.gz sct - https://cloudius-jenkins-test.s3.amazonaws.com/b52dea9a-13ac-4729-9c4e-ce78654e9ec6/20230911_161440/sct-b52dea9a.log.tar.gz event - https://cloudius-jenkins-test.s3.amazonaws.com/b52dea9a-13ac-4729-9c4e-ce78654e9ec6/20230911_161440/sct-runner-events-b52dea9a.tar.gz siren-manager-set - https://cloudius-jenkins-test.s3.amazonaws.com/b52dea9a-13ac-4729-9c4e-ce78654e9ec6/20230911_161440/siren-manager-set-b52dea9a.tar.gz

Jenkins job URL

ilya-rarov avatar Sep 17 '23 14:09 ilya-rarov

So, IIUC, we don't know what's the final size of it because the backup task failed in the middle. Can you please answer my questions in this comment.

Once you have the answers, please work with @ShlomiBalalis to understand if it's different from the results he has for manager backup testing. What's the expected overhead in percentage of manager backup and if it make sense for cloud clusters (also why they don't hit the same issue).

roydahan avatar Sep 18 '23 12:09 roydahan