scylla-cluster-tests From time to time monitor creation fails despite the instance was created

From time to time monitor creation fails despite the instance was created

Open juliayakovlev opened this issue 1 year ago • 1 comments

From time to time monitor creation fails despite the instance was created. Error:

< t:2024-03-06 20:54:36,659 f:gce_utils.py    l:504  c:sdcm.utils.gce_utils p:DEBUG > Creating the rolling-upgrade--centos-8-monitor-node-aff52e4d-0-1 instance in us-east1-c...
< t:2024-03-06 20:54:38,001 f:retry.py        l:351  c:urllib3.util.retry   p:DEBUG > Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)
< t:2024-03-06 20:54:46,895 f:gce_utils.py    l:314  c:sdcm.utils.gce_utils p:DEBUG > Warnings during instance creation:
< t:2024-03-06 20:54:46,895 f:gce_utils.py    l:316  c:sdcm.utils.gce_utils p:DEBUG >  - DISK_SIZE_LARGER_THAN_IMAGE_SIZE: Disk size: '50 GB' is larger than image size: '20 GB'. You might need to resize the root
 repartition manually if the operating system does not support automatic resizing. See https://cloud.google.com/compute/docs/disks/add-persistent-disk#resize_pd for details.
< t:2024-03-06 20:54:46,895 f:gce_utils.py    l:510  c:sdcm.utils.gce_utils p:DEBUG > Instance rolling-upgrade--centos-8-monitor-node-aff52e4d-0-1 created.
< t:2024-03-06 20:54:46,927 f:decorators.py   l:72   c:sdcm.utils.decorators p:DEBUG > '_create_node_with_retries': failed with 'Forbidden("GET https://compute.googleapis.com/compute/v1/projects/sct-project-1/zones/us-east1-c/instances/rolling-upgrade--centos-8-monitor-node-aff52e4d-0-1: Quota exceeded for quota metric 'Read requests' and limit 'Read requests per minute per region' of service 'compute.googleapis.com' for consumer 'project_number:1070746702980'.")', retrying [#0]

After that we retry to create. And it fails again because the instance exists:

< t:2024-03-06 21:24:48,468 f:gce_utils.py    l:504  c:sdcm.utils.gce_utils p:DEBUG > Creating the rolling-upgrade--centos-8-monitor-node-aff52e4d-0-1 instance in us-east1-c...
< t:2024-03-06 21:24:49,559 f:decorators.py   l:72   c:sdcm.utils.decorators p:DEBUG > '_create_node_with_retries': failed with 'Conflict("POST https://compute.googleapis.com/compute/v1/projects/sct-project-1/zones/us-east1-c/instances: The resource 'projects/sct-project-1/zones/us-east1-c/instances/rolling-upgrade--centos-8-monitor-node-aff52e4d-0-1' already exists")', retrying [#2]
< t:2024-03-06 21:39:49,587 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR > exception=409 POST https://compute.googleapis.com/compute/v1/projects/sct-project-1/zones/us-east1-c/instances: The resource 'projects/sct-project-1/zones/us-east1-c/instances/rolling-upgrade--centos-8-monitor-node-aff52e4d-0-1' already exists
< t:2024-03-06 21:39:49,580 f:tester.py       l:178  c:sdcm.tester          p:ERROR > google.api_core.exceptions.Conflict: 409 POST https://compute.googleapis.com/compute/v1/projects/sct-project-1/zones/us-east1-c/instances: The resource 'projects/sct-project-1/zones/us-east1-c/instances/rolling-upgrade--centos-8-monitor-node-aff52e4d-0-1' already exists

Issue description

[ ] This issue is a regression.
[ ] It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 4 nodes (n2-highmem-16)

Scylla Nodes used in this run:

rolling-upgrade--centos-8-db-node-aff52e4d-0-4 (34.148.92.165 | 10.142.15.230) (shards: -1)
rolling-upgrade--centos-8-db-node-aff52e4d-0-3 (35.237.201.62 | 10.142.15.229) (shards: -1)
rolling-upgrade--centos-8-db-node-aff52e4d-0-2 (35.231.17.226 | 10.142.15.225) (shards: -1)
rolling-upgrade--centos-8-db-node-aff52e4d-0-1 (34.23.89.16 | 10.142.15.220) (shards: -1)

OS / Image: https://www.googleapis.com/compute/v1/projects/centos-cloud/global/images/family/centos-stream-8 (gce: undefined_region)

Test: rolling-upgrade-centos8-test Test id: aff52e4d-4769-4151-a172-014fad77b5ce Test name: enterprise-2023.1/rolling-upgrade/rolling-upgrade-centos8-test Test config file(s):

rolling-upgrade.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor aff52e4d-4769-4151-a172-014fad77b5ce
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs aff52e4d-4769-4151-a172-014fad77b5ce

Logs:

db-cluster-aff52e4d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/aff52e4d-4769-4151-a172-014fad77b5ce/20240306_214214/db-cluster-aff52e4d.tar.gz
sct-runner-events-aff52e4d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/aff52e4d-4769-4151-a172-014fad77b5ce/20240306_214214/sct-runner-events-aff52e4d.tar.gz
sct-aff52e4d.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/aff52e4d-4769-4151-a172-014fad77b5ce/20240306_214214/sct-aff52e4d.log.tar.gz
loader-set-aff52e4d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/aff52e4d-4769-4151-a172-014fad77b5ce/20240306_214214/loader-set-aff52e4d.tar.gz
monitor-set-aff52e4d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/aff52e4d-4769-4151-a172-014fad77b5ce/20240306_214214/monitor-set-aff52e4d.tar.gz

Jenkins job URL Argus

Mar 10 '24 17:03 juliayakovlev

First we probably need the ask for more Quota

2nd, we could try ignoring the conflict and continue, but it might back fire, since we had cases the duplication was causing bugs in our pipeline that used the same test-id

Let's do a look up (on mails), how many of this issues is happening, to assess if we should attend it or not

Mar 10 '24 17:03 fruch

scylla-cluster-tests scylla-cluster-tests copied to clipboard

From time to time monitor creation fails despite the instance was created

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

scylla-cluster-tests
scylla-cluster-tests copied to clipboard