scylla-cluster-tests icon indicating copy to clipboard operation
scylla-cluster-tests copied to clipboard

`reach_enospc_on_node` was "paused" and continue to ran 4 hours later

Open juliayakovlev opened this issue 1 year ago • 7 comments

Enospace nemesis ran on longevity-tls-50gb-3d-2023-1-db-node-e4715f1f-3 at about 2024-04-02 13:51

< t:2024-04-02 13:51:27,189 f:wait.py         l:52   c:sdcm.wait            p:DEBUG > wait_for: Retrying Wait for new ENOSPC error occurs in database: attempt 4 ended with: False

sdcm.utils.common.reach_enospc_on_node runs fallocate to occupy the space and cause to "ENOSPC" DB error. It triggered max 5 times, up to enospc is reached.

4th attempt ended with False status. But after that "ENOSPC" DB error was received. Expected that it stopped to run fallocate. But 4 hours later, while HardRebootNode nemesis, fallocate was triggered again unexpectedly, that caused to enospace again when it shouldn't happen:

2024-04-02T19:10:33.117+00:00 longevity-tls-50gb-3d-2023-1-db-node-e4715f1f-3   !NOTICE | sudo[25664]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/fallocate -l 3163262287K /var/lib/scylla/occupy_90percent.1712065869.4517982
2024-04-02T19:10:33.118+00:00 longevity-tls-50gb-3d-2023-1-db-node-e4715f1f-3   !NOTICE | sudo[25669]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/fallocate -l 316208240K /var/lib/scylla/occupy_90percent.1712065875.7862625
2024-04-02T19:10:33.118+00:00 longevity-tls-50gb-3d-2023-1-db-node-e4715f1f-3   !NOTICE | sudo[25675]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/fallocate -l 31589173K /var/lib/scylla/occupy_90percent.1712065881.459667
2024-04-02T19:10:33.119+00:00 longevity-tls-50gb-3d-2023-1-db-node-e4715f1f-3   !NOTICE | sudo[25680]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/fallocate -l 2752405K /var/lib/scylla/occupy_90percent.1712065886.606151
2024-04-02T19:10:33.119+00:00 longevity-tls-50gb-3d-2023-1-db-node-e4715f1f-3      !ERR | scylla[24106]:  [shard  2] storage_service - Shutting down communications due to I/O errors until operator intervention: Disk error: std::system_error (error system:28, No space left on device)

It is not clear why reach_enospc_on_node was paused

Packages

Scylla version: 2023.1.7-20240402.1f84904a7100 with build-id d8d3f80469204be472563634963c19f86b295af0 Kernel Version: 5.15.0-1056-aws

Issue description

  • [ ] This issue is a regression.
  • [ ] It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-2023-1-db-node-e4715f1f-9 (54.172.241.192 | 10.12.9.213) (shards: 14)
  • longevity-tls-50gb-3d-2023-1-db-node-e4715f1f-8 (3.91.82.75 | 10.12.10.91) (shards: 14)
  • longevity-tls-50gb-3d-2023-1-db-node-e4715f1f-7 (34.227.58.82 | 10.12.11.22) (shards: 14)
  • longevity-tls-50gb-3d-2023-1-db-node-e4715f1f-6 (3.87.92.36 | 10.12.9.21) (shards: 14)
  • longevity-tls-50gb-3d-2023-1-db-node-e4715f1f-5 (3.82.46.122 | 10.12.8.217) (shards: 14)
  • longevity-tls-50gb-3d-2023-1-db-node-e4715f1f-4 (35.173.129.110 | 10.12.10.114) (shards: 14)
  • longevity-tls-50gb-3d-2023-1-db-node-e4715f1f-3 (18.215.157.218 | 10.12.9.10) (shards: 14)
  • longevity-tls-50gb-3d-2023-1-db-node-e4715f1f-2 (54.227.86.85 | 10.12.8.69) (shards: 14)
  • longevity-tls-50gb-3d-2023-1-db-node-e4715f1f-15 (54.227.56.75 | 10.12.10.93) (shards: 14)
  • longevity-tls-50gb-3d-2023-1-db-node-e4715f1f-14 (3.90.209.59 | 10.12.10.186) (shards: 14)
  • longevity-tls-50gb-3d-2023-1-db-node-e4715f1f-13 (44.223.31.62 | 10.12.9.249) (shards: 14)
  • longevity-tls-50gb-3d-2023-1-db-node-e4715f1f-12 (18.212.209.241 | 10.12.9.140) (shards: 14)
  • longevity-tls-50gb-3d-2023-1-db-node-e4715f1f-11 (34.228.15.225 | 10.12.10.197) (shards: 14)
  • longevity-tls-50gb-3d-2023-1-db-node-e4715f1f-10 (54.161.10.75 | 10.12.11.44) (shards: 14)
  • longevity-tls-50gb-3d-2023-1-db-node-e4715f1f-1 (54.226.232.95 | 10.12.10.142) (shards: 14)

OS / Image: ami-0bacdd33da744755f (aws: undefined_region)

Test: longevity-50gb-3days-test Test id: e4715f1f-ca8b-4ae0-af80-fe31a242fbc0 Test name: enterprise-2023.1/longevity/longevity-50gb-3days-test Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor e4715f1f-ca8b-4ae0-af80-fe31a242fbc0
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs e4715f1f-ca8b-4ae0-af80-fe31a242fbc0

Logs:

Jenkins job URL Argus

juliayakovlev avatar Apr 03 '24 13:04 juliayakovlev

Timestamp in the filenames in the fallocate commands, are matching 13:51

So this is the actual command from the enospc nemesis, but nothing explains how they were run again...

So it's not that the same python code is executed, cause it would have gotten a fresh timestamp on the filename.

So or we have some very weird retry that starts in a thread and we aren't aware of.

Or something on the sshd side, but I never heard of such a "feature" in ssh

fruch avatar Apr 03 '24 19:04 fruch

I think this is our old friend 'log duplication' - see that after these hours we see exactly the same PID's of scylla, just logs repeats.

soyacz avatar Apr 19 '24 15:04 soyacz

The same happened here - logs are duplicated in messages.log (while system.log looks fine). E.g. see the error node_ops - decommission[e0cb7dab-6768-4989-a5da-9d6ac1b60ee5]: Operation failed, So syslogng is making us some surprises.

Installation details

Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

  • longevity-twcs-48h-master-db-node-f460b9df-5 (3.255.97.78 | 10.4.8.202) (shards: 7)
  • longevity-twcs-48h-master-db-node-f460b9df-4 (34.241.78.154 | 10.4.8.78) (shards: 7)
  • longevity-twcs-48h-master-db-node-f460b9df-3 (3.255.170.137 | 10.4.10.177) (shards: 7)
  • longevity-twcs-48h-master-db-node-f460b9df-2 (34.242.193.87 | 10.4.10.110) (shards: 7)
  • longevity-twcs-48h-master-db-node-f460b9df-1 (54.75.185.228 | 10.4.8.69) (shards: 7)

OS / Image: ami-03a7123d7bfc23027 (aws: undefined_region)

Test: longevity-twcs-48h-test Test id: f460b9df-f7ed-443c-aad0-9111d90a2f78 Test name: scylla-master/tier1/longevity-twcs-48h-test Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor f460b9df-f7ed-443c-aad0-9111d90a2f78
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs f460b9df-f7ed-443c-aad0-9111d90a2f78

Logs:

Jenkins job URL Argus

soyacz avatar Apr 19 '24 15:04 soyacz

I think this is our old friend 'log duplication' - see that after these hours we see exactly the same PID's of scylla, just logs repeats.

I remember we had such issues with operator runs, I don't ever recall those with syslog-ng

fruch avatar Apr 21 '24 05:04 fruch

I think this is our old friend 'log duplication' - see that after these hours we see exactly the same PID's of scylla, just logs repeats.

I remember we had such issues with operator runs, I don't ever recall those with syslog-ng

I think it was with rsyslog. Anyway, these are just duplicates and we need to figure out why syslog-ng sends them again.

soyacz avatar Apr 22 '24 08:04 soyacz

I think this is our old friend 'log duplication' - see that after these hours we see exactly the same PID's of scylla, just logs repeats.

I remember we had such issues with operator runs, I don't ever recall those with syslog-ng

I think it was with rsyslog. Anyway, these are just duplicates and we need to figure out why syslog-ng sends them again.

On the original reporting here, it wasn't duplication of logs, since they weren't identical, they were with different timestamps on the filename being created

fruch avatar Apr 25 '24 12:04 fruch

I verified also original reporting and found that command and error patterns are identical - even scylla PID is the same - which normally does not happen. Only timestamps differ, but these I believe are generated on sct-runner (for messages.log).

soyacz avatar Apr 26 '24 04:04 soyacz

@soyacz

yes, now I see it's the exact same timestamp on the command itself, and it was caused by the hard reboot after hard reboot we configured the logging system from scratch, and now we shouldn't be doing it anymore.

cause of you recent change

fruch avatar May 06 '24 06:05 fruch