scylla-operator
scylla-operator copied to clipboard
[WIP] Fix manager task reconciliation and status synchronisation
Description of your changes: As described in https://github.com/scylladb/scylla-operator/issues/1752, there's currently an issue with manager tasks being deleted and recreated when the manager controller reconciles an older version of the object. The root cause of the issue is that the controller depends on the object's status to save and retrieve the task's ID, which it uses for identifying the task. If an older generation of the object is reconciled, and it's missing the task's ID from the status, it's going to delete the task and recreate it, even if the task is present in the manager's state. The main change in this PR is to make the controller use tasks' names for identity, instead of their ID, which can't be retrieved if it's missing from the status.
To make it possible, the PR modifies the validation logic to allow for non-unique names across tasks of different types. This is aligned with Scylla Manager, which doesn't impose such restrictions, i.e. it is enough that the name is unique across tasks of equal type. This allows us to use tasks' name for identity, and in turn, reconciling the tasks without retrieving their IDs.
Futhermore, currently the tasks' statuses inline tasks' specs as defined in ScyllaCluster spec. Due to the spec having some default values, this currently causes the statuses to not be aligned with manager's state, and even displaying default values on task scheduling error. This PR fixes it by:
- splitting tasks' statuses from specs in API definition to avoid the default values being displayed and making the fields optional
- propagating tasks' statuses as they're pertained in manager state, with an exception for client side errors - in this case, only the task name and error are saved in the object's status, instead of copying over the definition from spec
On top of that, the existing e2e covering repair task and an invalid backup task integration is split into two separate tests: one testing the repair task integration, and the other testing error propagation for both repair and backup tasks.
As agreed internally, this PR tries to achieve while minimising the amount of refactoring of the existing codebase. Therefore the general flow of the manager controller is maintained as is.
Prerequisite for https://github.com/scylladb/scylla-operator/issues/1671.
Which issue is resolved by this Pull Request: Resolves #1752
/kind bug /priority critical-urgent /cc
@rzetelskik: GitHub didn't allow me to request PR reviews from the following users: rzetelskik.
Note that only scylladb members and repo collaborators can review this PR, and authors cannot review their own PRs.
In response to this:
Description of your changes:
Which issue is resolved by this Pull Request: Resolves #1752
/kind bug /priority critical-urgent /cc
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/hold for https://github.com/scylladb/scylla-operator/pull/1851
@rzetelskik: The /test
command needs one or more targets.
The following commands are available to trigger required jobs:
-
/test build
-
/test docs
-
/test e2e-gke-parallel
-
/test e2e-gke-parallel-clusterip
-
/test e2e-gke-release-script-latest
-
/test e2e-gke-serial
-
/test helm-build
-
/test images
-
/test unit
-
/test verify
-
/test verify-deps
-
/test verify-vendorability
The following commands are available to trigger optional jobs:
-
/test docs-multiversion
Use /test all
to run the following jobs that were automatically triggered:
-
pull-scylla-operator-master-build
-
pull-scylla-operator-master-docs
-
pull-scylla-operator-master-docs-multiversion
-
pull-scylla-operator-master-e2e-gke-parallel
-
pull-scylla-operator-master-e2e-gke-parallel-clusterip
-
pull-scylla-operator-master-e2e-gke-serial
-
pull-scylla-operator-master-helm-build
-
pull-scylla-operator-master-images
-
pull-scylla-operator-master-unit
-
pull-scylla-operator-master-verify
-
pull-scylla-operator-master-verify-deps
-
pull-scylla-operator-master-verify-vendorability
In response to this:
/test
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/test all
/hold cancel
/cc zimnx tnozicka
@rzetelskik: The following tests failed, say
/retest
to rerun all failed tests or/retest-required
to rerun all mandatory failed tests:Test name Commit Details Required Rerun command ci/prow/e2e-gke-release-script-latest 0df1c7c link true
/test e2e-gke-release-script-latest
ci/prow/e2e-gke-parallel-clusterip 0c00701 link true/test e2e-gke-parallel-clusterip
Full PR test history. Your PR dashboard.
https://github.com/scylladb/scylla-operator/issues/1028#issuecomment-2072828303 shard awareness flake, unrelated to this PR so I'm not investigating /retest
/test images e2e-gke-parallel e2e-gke-parallel-clusterip
/test images e2e-gke-parallel e2e-gke-parallel-clusterip
e2e namespace wasn't collected
/test images e2e-gke-parallel e2e-gke-parallel-clusterip
@rzetelskik: The following tests failed, say
/retest
to rerun all failed tests or/retest-required
to rerun all mandatory failed tests:Test name Commit Details Required Rerun command ci/prow/e2e-gke-release-script-latest 0df1c7c link true
/test e2e-gke-release-script-latest
ci/prow/e2e-gke-serial 3cf516c link true/test e2e-gke-serial
ci/prow/e2e-gke-parallel 3cf516c link true/test e2e-gke-parallel
ci/prow/e2e-gke-parallel-clusterip 3cf516c link true/test e2e-gke-parallel-clusterip
Full PR test history. Your PR dashboard.
Cluster provisioning failed, I looked into CI but didn't see anything off. Spinning it again to investigate. /test e2e-gke-parallel e2e-gke-parallel-clusterip
@zimnx thanks for the review, I believe I addressed all of your comments now
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: rzetelskik, zimnx
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [zimnx]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
@rzetelskik: The following test failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
ci/prow/e2e-gke-release-script-latest | 0df1c7c093ded877572bf5f67f8807bb8d417164 | link | true | /test e2e-gke-release-script-latest |
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
@rzetelskik: The following tests failed, say
/retest
to rerun all failed tests or/retest-required
to rerun all mandatory failed tests:Test name Commit Details Required Rerun command ci/prow/e2e-gke-release-script-latest 0df1c7c link true
/test e2e-gke-release-script-latest
ci/prow/e2e-gke-parallel 0e85808 link true/test e2e-gke-parallel
Full PR test history. Your PR dashboard.
• [FAILED] [842.163 seconds]
ScyllaCluster [It] nodes are cleaned up after horizontal scaling
github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_cleanup.go:29
...
[FAILED] in [It] - github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_cleanup.go:150 @ 04/29/24 11:33:46.082
unrelated, not investigating https://github.com/scylladb/scylla-operator/issues/1525#issuecomment-2083204706
/retest