scylla-cluster-tests
scylla-cluster-tests copied to clipboard
[GCE] Quota 'SSD_TOTAL_GB' exceeded. when running multiple "big" jobs
Packages
Issue description
- [ ] This issue is a regression.
- [x] It is unknown if this issue is a regression.
running multiple user cases setup in GCE got us to this limit:
2024-11-11 16:43:19.538: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=b2a085a7-43b8-415a-ba9c-16a44d15e530, source=LongevityTest.SetUp()
exception=403 FORBIDDEN QUOTA_EXCEEDED: Quota 'SSD_TOTAL_GB' exceeded. Limit: 81920.0 in region us-east1.
Impact
we are limited to the number of similar jobs we run at a time.
How frequently does it reproduce?
happened once so far
Installation details
Cluster size: 10 nodes (n2-highmem-32)
Scylla Nodes used in this run:
- long-custom-d1-wrkld2-2024-2-db-node-f867e96c-0-5 (35.196.196.56 | 10.142.0.114) (shards: -1)
- long-custom-d1-wrkld2-2024-2-db-node-f867e96c-0-4 (35.227.4.245 | 10.142.0.111) (shards: -1)
- long-custom-d1-wrkld2-2024-2-db-node-f867e96c-0-3 (35.231.110.211 | 10.142.0.101) (shards: -1)
- long-custom-d1-wrkld2-2024-2-db-node-f867e96c-0-2 (34.73.74.3 | 10.142.0.94) (shards: -1)
- long-custom-d1-wrkld2-2024-2-db-node-f867e96c-0-1 (34.73.225.113 | 10.142.0.65) (shards: -1)
OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/scylla-enterprise-2024-2-0-rc4-x86-64-2024-11-11t08-52-30 (gce: undefined_region)
Test: longevity-gce-custom-d1-worklod2-hybrid-raid-test
Test id: f867e96c-63e4-46be-addb-3a8a963f1dbe
Test name: enterprise-2024.2/longevity/longevity-gce-custom-d1-worklod2-hybrid-raid-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor f867e96c-63e4-46be-addb-3a8a963f1dbe - Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs f867e96c-63e4-46be-addb-3a8a963f1dbe
Logs:
- db-cluster-f867e96c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f867e96c-63e4-46be-addb-3a8a963f1dbe/20241111_164354/db-cluster-f867e96c.tar.gz
- sct-runner-events-f867e96c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f867e96c-63e4-46be-addb-3a8a963f1dbe/20241111_164354/sct-runner-events-f867e96c.tar.gz
- sct-f867e96c.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f867e96c-63e4-46be-addb-3a8a963f1dbe/20241111_164354/sct-f867e96c.log.tar.gz
@roydahan what do you say, we should just ask for bigger limit in our project ?
How many jobs we had in parallel? This specific job uses 15TB, but if it's multi-dc, only half of it should be in us-east-1 where we exceed the qouta. so, it may be considered as "big job" but consuming only ~9% of the quota.
How many jobs we had in parallel? This specific job uses 15TB, but if it's multi-dc, only half of it should be in us-east-1 where we exceed the qouta. so, it may be considered as "big job" but consuming only ~9% of the quota.
This is GCE backend. We use only single region there - us-east1.
In current case it was simulated multi-dc setup that uses single real region.
@roydahan
this one is repeat when we have a release or multiple releases running at the same time keep in mind we have some customer cases, which are quite heavy, that we didn't had before and this limit is blocking us from running them during release cycles
I'm gonna ask for ~30% more capability
I claim that it happens when people aren't cleaning resources. What we currently have is more than enough, but you can ask for increase.
I claim that it happens when people aren't cleaning resources. What we currently have is more than enough, but you can ask for increase.
it was a bunched of stopped machines, that no one was noticing that was holding 40K GB for more then 3 months.
- I've cleared those machines
- I'll look into why those weren't listed in the daily reports
- I've asked for 30% more, so it won't hold us during releases (seem like we have picks multiple times a week)
closing for now, until we hit those kind of things again