terraform-provider-kops icon indicating copy to clipboard operation
terraform-provider-kops copied to clipboard

[bug][v1.23]: cluster_ca_cert and cluster_ca_key always trigger cluster updater

Open ddelange opened this issue 2 years ago • 31 comments

Hi again!

image

I just tried out v1.23, and spotted this one triggering the cluster updater. I haven't provided the secrets block in my cluster config.

ddelange avatar Mar 31 '22 11:03 ddelange

Thanks for reporting. Does it happen every time or only when you upgrade from a previous version of the provider ?

eddycharly avatar Mar 31 '22 21:03 eddycharly

I would say every time, after a couple applies like that I just checked this morning and again the same scenario on a terraform apply: revision = 13 -> 14 on kops_cluster.cluster

Maybe useful info: I have completely destroyed the 1.22 cluster and restarted 1.23 in a different AZ as part of the upgrade, so the chances of 1.22 remnants are small (although I didn't explicitly check whether tfstate was empty/deleted before spinning up 1.23)

ddelange avatar Apr 01 '22 06:04 ddelange

Just for helping out narrowing in what might be the problem, I don't see this on my own prod-cluster where I have a docker config defined and not on a fresh test-cluster either where I have no docker config.

argoyle avatar Apr 01 '22 06:04 argoyle

@ddelange did you see the same problem with 1.22 or only 1.23 ?

eddycharly avatar Apr 01 '22 06:04 eddycharly

I spent some time trying to reproduce the issue, but I didn't succeed. Can you share your tf config ?

eddycharly avatar Apr 01 '22 07:04 eddycharly

Hmm, interesting! Thanks for checking guys. This was not the case before upgrading to 1.23. The only diff to cluster.tf apart from the version bump, was adding the containerd.config_override block. Adding a (basicauth) private docker registry to the cluster was the whole reason for updrading to 1.23: as 1.22 has max containerd version 1.4, I needed 1.5+ such that we can make use of containerd registry mirror auth functionality.

Here's an excerpt of our cluster.tf

ddelange avatar Apr 01 '22 07:04 ddelange

This looks similar to what I tested. You used the provider v1.23.0-alpha.1 with k8s 1.23 right ?

eddycharly avatar Apr 01 '22 08:04 eddycharly

Correct, 1.23.0-alpha.1 and

variable "kubernetes_version" {
  type        = string
  description = "Kubernetes version to use for the cluster. MAJOR.MINOR here should not be newer than the kops provider version in versions.tf ref https://kops.sigs.k8s.io/welcome/releases/"
  default     = "v1.23.5"
}

I'm now trying to isolate the bug to the config_override block (applying now without it, seeing if afterwards another apply will trigger the updater again)

ddelange avatar Apr 01 '22 08:04 ddelange

I don't think your issue is related to the config override block, but thanks for trying it out.

eddycharly avatar Apr 01 '22 08:04 eddycharly

Looks like unrelated indeed. Behaviour stays the same 🤔

ddelange avatar Apr 01 '22 08:04 ddelange

Could be related to permissions in the s3 bucket. Can you check the files are correctly stored in the bucket (pki/private/kubernetes-ca/keyset.yaml) ?

eddycharly avatar Apr 01 '22 08:04 eddycharly

The file exists, looks like a valid manifest. Created (r/w perms) by my AWS account yesterday when I created the 1.23 cluster. Same permissions as the rest of the files, e.g. under /instancegroups

ddelange avatar Apr 01 '22 08:04 ddelange

Hmmm, I'm dry on ideas :-( There was an invalid reference to the k/k package in v1.23.0-alpha.1 but I don't think it would cause such an issue. I can try to cut v1.23.0-alpha.2 though.

eddycharly avatar Apr 01 '22 08:04 eddycharly

Many thanks for taking the time!

I would just close with that for now and if it disappears over time (or I miraculously find a fix) I will report back here.

Have a nice weekend 💥

ddelange avatar Apr 01 '22 08:04 ddelange

hotfix seems to be holding steady :)

  lifecycle {
    ignore_changes = [
      secrets,
    ]
  }

ddelange avatar Apr 02 '22 16:04 ddelange

It’s a hack, you shouldn’t need this. If possible I would empty the bucket and try from scratch completely.

eddycharly avatar Apr 02 '22 19:04 eddycharly

I just released v1.23.0-alpha.2 but I doubt it will fix your issue.

eddycharly avatar Apr 02 '22 20:04 eddycharly

Thanks for the ping!

ddelange avatar Apr 02 '22 20:04 ddelange

If possible I would empty the bucket and try from scratch completely.

Seems like it did not solve the issue 🤔

ddelange avatar Apr 03 '22 20:04 ddelange

Me and @argoyle had this issue as well on some clusters. Seems to work if we always define secrets in the config, like:

  secrets {
    docker_config   = "{}"
  }

peter-svensson avatar Jun 02 '22 10:06 peter-svensson

Interesting, I am going to reopen and investigate the issue again.

eddycharly avatar Jun 02 '22 10:06 eddycharly

I was able to somewhat reproduce the issue:

  • create a cluster with ca cert/private key
  • update the cluster, removing ca cert/private key
  • next applies always trigger an update

Would that look like the scenario you are hitting ?

eddycharly avatar Jun 02 '22 13:06 eddycharly

fwiw, I never used the secrets block

ddelange avatar Jun 02 '22 14:06 ddelange

I have a good suspect in mind but no access to an AWS account, making it difficult to track it down.

CAs are stored in pki/private/kubernetes-ca/keyset.yaml (maybe in pki/private/ca/keyset.yaml with older kOps versions).

When one doesn't provide a CA cert/key, kOps will create one and I'm not sure if it is stored in the same place (I suppose it is but can't confirm).

From what I understand, it's not possible to remove a CA that is being used so it is probably related.

What could happen:

  • You create a cluster without specifying the CA
  • kOps generates a CA automatically and stores it
  • Finally the CA ends up in the state
  • When you run a plan, if you don't specify the secrets block, terraform tries to delete the CA
  • Deleting a CA is not possible (we can eventually rotate it but not delete a CA being used)
  • The CA will constantly be added to the state and there will be a permanent diff

Now, if you specify the secrets block (you can leave it empty) it looks like terraform is not complaining anymore (because CA cert/key are marked as computed).

resource "kops_cluster" "cluster" {
  // ...
  secrets {}
  // ...
}

It would be nice if someone can confirm this.

eddycharly avatar Jun 02 '22 15:06 eddycharly

@peter-svensson @argoyle I suspect you can use an empty secrets block, no need to use a dummy docker_config as mentioned in my comment above.

eddycharly avatar Jun 02 '22 15:06 eddycharly

--- a/k8s/kops/cluster.tf
+++ b/k8s/kops/cluster.tf
@@ -217,12 +217,7 @@ EOF
     }
   }

-  lifecycle {
-    ignore_changes = [
-      secrets,
-    ]
-  }
+  secrets {}
 }

did indeed not trigger the updater!

ddelange avatar Jun 02 '22 17:06 ddelange

Seems to do the trick with our setup as well 🎉

argoyle avatar Jun 02 '22 17:06 argoyle

we also have a

  authorization {
    always_allow {}
  }

block for the same reason btw :)

ddelange avatar Jun 02 '22 17:06 ddelange

Great, thanks for testing it ! I will consider making secrets required and document that it can be left empty.

eddycharly avatar Jun 02 '22 17:06 eddycharly

@ddelange why not RBAC ?

  authorization {
    rbac {}
  }

eddycharly avatar Jun 13 '22 20:06 eddycharly