scylla-cluster-tests icon indicating copy to clipboard operation
scylla-cluster-tests copied to clipboard

[GCE] Quota 'SSD_TOTAL_GB' exceeded. when running multiple "big" jobs

Open fruch opened this issue 1 year ago • 6 comments

Packages

Issue description

  • [ ] This issue is a regression.
  • [x] It is unknown if this issue is a regression.

running multiple user cases setup in GCE got us to this limit:

2024-11-11 16:43:19.538: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=b2a085a7-43b8-415a-ba9c-16a44d15e530, source=LongevityTest.SetUp()
exception=403 FORBIDDEN QUOTA_EXCEEDED: Quota 'SSD_TOTAL_GB' exceeded.  Limit: 81920.0 in region us-east1.

Impact

we are limited to the number of similar jobs we run at a time.

How frequently does it reproduce?

happened once so far

Installation details

Cluster size: 10 nodes (n2-highmem-32)

Scylla Nodes used in this run:

  • long-custom-d1-wrkld2-2024-2-db-node-f867e96c-0-5 (35.196.196.56 | 10.142.0.114) (shards: -1)
  • long-custom-d1-wrkld2-2024-2-db-node-f867e96c-0-4 (35.227.4.245 | 10.142.0.111) (shards: -1)
  • long-custom-d1-wrkld2-2024-2-db-node-f867e96c-0-3 (35.231.110.211 | 10.142.0.101) (shards: -1)
  • long-custom-d1-wrkld2-2024-2-db-node-f867e96c-0-2 (34.73.74.3 | 10.142.0.94) (shards: -1)
  • long-custom-d1-wrkld2-2024-2-db-node-f867e96c-0-1 (34.73.225.113 | 10.142.0.65) (shards: -1)

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/scylla-enterprise-2024-2-0-rc4-x86-64-2024-11-11t08-52-30 (gce: undefined_region)

Test: longevity-gce-custom-d1-worklod2-hybrid-raid-test Test id: f867e96c-63e4-46be-addb-3a8a963f1dbe Test name: enterprise-2024.2/longevity/longevity-gce-custom-d1-worklod2-hybrid-raid-test Test method: longevity_test.LongevityTest.test_custom_time Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor f867e96c-63e4-46be-addb-3a8a963f1dbe
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs f867e96c-63e4-46be-addb-3a8a963f1dbe

Logs:

Jenkins job URL Argus

fruch avatar Nov 13 '24 07:11 fruch

@roydahan what do you say, we should just ask for bigger limit in our project ?

fruch avatar Nov 13 '24 14:11 fruch

How many jobs we had in parallel? This specific job uses 15TB, but if it's multi-dc, only half of it should be in us-east-1 where we exceed the qouta. so, it may be considered as "big job" but consuming only ~9% of the quota.

roydahan avatar Nov 14 '24 23:11 roydahan

How many jobs we had in parallel? This specific job uses 15TB, but if it's multi-dc, only half of it should be in us-east-1 where we exceed the qouta. so, it may be considered as "big job" but consuming only ~9% of the quota.

This is GCE backend. We use only single region there - us-east1. In current case it was simulated multi-dc setup that uses single real region.

vponomaryov avatar Nov 15 '24 10:11 vponomaryov

@roydahan

this one is repeat when we have a release or multiple releases running at the same time keep in mind we have some customer cases, which are quite heavy, that we didn't had before and this limit is blocking us from running them during release cycles

I'm gonna ask for ~30% more capability

fruch avatar Mar 13 '25 06:03 fruch

I claim that it happens when people aren't cleaning resources. What we currently have is more than enough, but you can ask for increase.

roydahan avatar Mar 19 '25 14:03 roydahan

I claim that it happens when people aren't cleaning resources. What we currently have is more than enough, but you can ask for increase.

it was a bunched of stopped machines, that no one was noticing that was holding 40K GB for more then 3 months.

  1. I've cleared those machines
  2. I'll look into why those weren't listed in the daily reports
  3. I've asked for 30% more, so it won't hold us during releases (seem like we have picks multiple times a week)

fruch avatar Mar 20 '25 07:03 fruch

closing for now, until we hit those kind of things again

fruch avatar Apr 03 '25 16:04 fruch