scylla-cluster-tests [GCE] Quota 'SSD_TOTAL_GB' exceeded. when running multiple "big" jobs

Packages

Issue description

[ ] This issue is a regression.
[x] It is unknown if this issue is a regression.

running multiple user cases setup in GCE got us to this limit:

2024-11-11 16:43:19.538: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=b2a085a7-43b8-415a-ba9c-16a44d15e530, source=LongevityTest.SetUp()
exception=403 FORBIDDEN QUOTA_EXCEEDED: Quota 'SSD_TOTAL_GB' exceeded.  Limit: 81920.0 in region us-east1.

Impact

we are limited to the number of similar jobs we run at a time.

How frequently does it reproduce?

happened once so far

Installation details

Cluster size: 10 nodes (n2-highmem-32)

Scylla Nodes used in this run:

long-custom-d1-wrkld2-2024-2-db-node-f867e96c-0-5 (35.196.196.56 | 10.142.0.114) (shards: -1)
long-custom-d1-wrkld2-2024-2-db-node-f867e96c-0-4 (35.227.4.245 | 10.142.0.111) (shards: -1)
long-custom-d1-wrkld2-2024-2-db-node-f867e96c-0-3 (35.231.110.211 | 10.142.0.101) (shards: -1)
long-custom-d1-wrkld2-2024-2-db-node-f867e96c-0-2 (34.73.74.3 | 10.142.0.94) (shards: -1)
long-custom-d1-wrkld2-2024-2-db-node-f867e96c-0-1 (34.73.225.113 | 10.142.0.65) (shards: -1)

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/scylla-enterprise-2024-2-0-rc4-x86-64-2024-11-11t08-52-30 (gce: undefined_region)

Test: longevity-gce-custom-d1-worklod2-hybrid-raid-test Test id: f867e96c-63e4-46be-addb-3a8a963f1dbe Test name: enterprise-2024.2/longevity/longevity-gce-custom-d1-worklod2-hybrid-raid-test Test method: longevity_test.LongevityTest.test_custom_time Test config file(s):

longevity-gce-custom-d1-workload2-hybrid-raid.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor f867e96c-63e4-46be-addb-3a8a963f1dbe
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs f867e96c-63e4-46be-addb-3a8a963f1dbe

Logs:

db-cluster-f867e96c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f867e96c-63e4-46be-addb-3a8a963f1dbe/20241111_164354/db-cluster-f867e96c.tar.gz
sct-runner-events-f867e96c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f867e96c-63e4-46be-addb-3a8a963f1dbe/20241111_164354/sct-runner-events-f867e96c.tar.gz
sct-f867e96c.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f867e96c-63e4-46be-addb-3a8a963f1dbe/20241111_164354/sct-f867e96c.log.tar.gz

Jenkins job URL Argus

Nov 13 '24 07:11 fruch

@roydahan what do you say, we should just ask for bigger limit in our project ?

Nov 13 '24 14:11 fruch

How many jobs we had in parallel? This specific job uses 15TB, but if it's multi-dc, only half of it should be in us-east-1 where we exceed the qouta. so, it may be considered as "big job" but consuming only ~9% of the quota.

Nov 14 '24 23:11 roydahan

How many jobs we had in parallel? This specific job uses 15TB, but if it's multi-dc, only half of it should be in us-east-1 where we exceed the qouta. so, it may be considered as "big job" but consuming only ~9% of the quota.

This is GCE backend. We use only single region there - us-east1. In current case it was simulated multi-dc setup that uses single real region.

Nov 15 '24 10:11 vponomaryov

@roydahan

this one is repeat when we have a release or multiple releases running at the same time keep in mind we have some customer cases, which are quite heavy, that we didn't had before and this limit is blocking us from running them during release cycles

I'm gonna ask for ~30% more capability

Mar 13 '25 06:03 fruch

I claim that it happens when people aren't cleaning resources. What we currently have is more than enough, but you can ask for increase.

Mar 19 '25 14:03 roydahan

I claim that it happens when people aren't cleaning resources. What we currently have is more than enough, but you can ask for increase.

it was a bunched of stopped machines, that no one was noticing that was holding 40K GB for more then 3 months.

I've cleared those machines
I'll look into why those weren't listed in the daily reports
I've asked for 30% more, so it won't hold us during releases (seem like we have picks multiple times a week)

Mar 20 '25 07:03 fruch

closing for now, until we hit those kind of things again

Apr 03 '25 16:04 fruch

scylla-cluster-tests scylla-cluster-tests copied to clipboard

[GCE] Quota 'SSD_TOTAL_GB' exceeded. when running multiple "big" jobs

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

scylla-cluster-tests
scylla-cluster-tests copied to clipboard