scylla-operator icon indicating copy to clipboard operation
scylla-operator copied to clipboard

[WIP] Fix manager task reconciliation and status synchronisation

Open rzetelskik opened this issue 11 months ago • 4 comments

Description of your changes: As described in https://github.com/scylladb/scylla-operator/issues/1752, there's currently an issue with manager tasks being deleted and recreated when the manager controller reconciles an older version of the object. The root cause of the issue is that the controller depends on the object's status to save and retrieve the task's ID, which it uses for identifying the task. If an older generation of the object is reconciled, and it's missing the task's ID from the status, it's going to delete the task and recreate it, even if the task is present in the manager's state. The main change in this PR is to make the controller use tasks' names for identity, instead of their ID, which can't be retrieved if it's missing from the status.

To make it possible, the PR modifies the validation logic to allow for non-unique names across tasks of different types. This is aligned with Scylla Manager, which doesn't impose such restrictions, i.e. it is enough that the name is unique across tasks of equal type. This allows us to use tasks' name for identity, and in turn, reconciling the tasks without retrieving their IDs.

Futhermore, currently the tasks' statuses inline tasks' specs as defined in ScyllaCluster spec. Due to the spec having some default values, this currently causes the statuses to not be aligned with manager's state, and even displaying default values on task scheduling error. This PR fixes it by:

  1. splitting tasks' statuses from specs in API definition to avoid the default values being displayed and making the fields optional
  2. propagating tasks' statuses as they're pertained in manager state, with an exception for client side errors - in this case, only the task name and error are saved in the object's status, instead of copying over the definition from spec

On top of that, the existing e2e covering repair task and an invalid backup task integration is split into two separate tests: one testing the repair task integration, and the other testing error propagation for both repair and backup tasks.

As agreed internally, this PR tries to achieve while minimising the amount of refactoring of the existing codebase. Therefore the general flow of the manager controller is maintained as is.

Prerequisite for https://github.com/scylladb/scylla-operator/issues/1671.

Which issue is resolved by this Pull Request: Resolves #1752

/kind bug /priority critical-urgent /cc

rzetelskik avatar Mar 19 '24 20:03 rzetelskik

@rzetelskik: GitHub didn't allow me to request PR reviews from the following users: rzetelskik.

Note that only scylladb members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Description of your changes:

Which issue is resolved by this Pull Request: Resolves #1752

/kind bug /priority critical-urgent /cc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/hold for https://github.com/scylladb/scylla-operator/pull/1851

rzetelskik avatar Mar 19 '24 23:03 rzetelskik

@rzetelskik: The /test command needs one or more targets. The following commands are available to trigger required jobs:

  • /test build
  • /test docs
  • /test e2e-gke-parallel
  • /test e2e-gke-parallel-clusterip
  • /test e2e-gke-release-script-latest
  • /test e2e-gke-serial
  • /test helm-build
  • /test images
  • /test unit
  • /test verify
  • /test verify-deps
  • /test verify-vendorability

The following commands are available to trigger optional jobs:

  • /test docs-multiversion

Use /test all to run the following jobs that were automatically triggered:

  • pull-scylla-operator-master-build
  • pull-scylla-operator-master-docs
  • pull-scylla-operator-master-docs-multiversion
  • pull-scylla-operator-master-e2e-gke-parallel
  • pull-scylla-operator-master-e2e-gke-parallel-clusterip
  • pull-scylla-operator-master-e2e-gke-serial
  • pull-scylla-operator-master-helm-build
  • pull-scylla-operator-master-images
  • pull-scylla-operator-master-unit
  • pull-scylla-operator-master-verify
  • pull-scylla-operator-master-verify-deps
  • pull-scylla-operator-master-verify-vendorability

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/test all

rzetelskik avatar Apr 23 '24 11:04 rzetelskik

/hold cancel

rzetelskik avatar Apr 23 '24 12:04 rzetelskik

/cc zimnx tnozicka

rzetelskik avatar Apr 23 '24 13:04 rzetelskik

@rzetelskik: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command ci/prow/e2e-gke-release-script-latest 0df1c7c link true /test e2e-gke-release-script-latest ci/prow/e2e-gke-parallel-clusterip 0c00701 link true /test e2e-gke-parallel-clusterip Full PR test history. Your PR dashboard.

https://github.com/scylladb/scylla-operator/issues/1028#issuecomment-2072828303 shard awareness flake, unrelated to this PR so I'm not investigating /retest

rzetelskik avatar Apr 23 '24 16:04 rzetelskik

/test images e2e-gke-parallel e2e-gke-parallel-clusterip

rzetelskik avatar Apr 25 '24 12:04 rzetelskik

/test images e2e-gke-parallel e2e-gke-parallel-clusterip

e2e namespace wasn't collected

/test images e2e-gke-parallel e2e-gke-parallel-clusterip

rzetelskik avatar Apr 25 '24 13:04 rzetelskik

@rzetelskik: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command ci/prow/e2e-gke-release-script-latest 0df1c7c link true /test e2e-gke-release-script-latest ci/prow/e2e-gke-serial 3cf516c link true /test e2e-gke-serial ci/prow/e2e-gke-parallel 3cf516c link true /test e2e-gke-parallel ci/prow/e2e-gke-parallel-clusterip 3cf516c link true /test e2e-gke-parallel-clusterip Full PR test history. Your PR dashboard.

Cluster provisioning failed, I looked into CI but didn't see anything off. Spinning it again to investigate. /test e2e-gke-parallel e2e-gke-parallel-clusterip

rzetelskik avatar Apr 25 '24 13:04 rzetelskik

@zimnx thanks for the review, I believe I addressed all of your comments now

rzetelskik avatar Apr 29 '24 11:04 rzetelskik

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rzetelskik, zimnx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

@rzetelskik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gke-release-script-latest 0df1c7c093ded877572bf5f67f8807bb8d417164 link true /test e2e-gke-release-script-latest

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@rzetelskik: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command ci/prow/e2e-gke-release-script-latest 0df1c7c link true /test e2e-gke-release-script-latest ci/prow/e2e-gke-parallel 0e85808 link true /test e2e-gke-parallel Full PR test history. Your PR dashboard.

• [FAILED] [842.163 seconds]
ScyllaCluster [It] nodes are cleaned up after horizontal scaling
github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_cleanup.go:29
...
  [FAILED] in [It] - github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_cleanup.go:150 @ 04/29/24 11:33:46.082

unrelated, not investigating https://github.com/scylladb/scylla-operator/issues/1525#issuecomment-2083204706

/retest

rzetelskik avatar Apr 29 '24 16:04 rzetelskik