testkube icon indicating copy to clipboard operation
testkube copied to clipboard

`webhook-cert-patch` fails with `connect: connection refused` - `error getting secret`

Open spkane opened this issue 1 year ago • 9 comments

Describe the bug

When installing Testkube Helm chart version 1.16.17 in a cluster that previously had the chart v1.14.0 installed, we see an error from the webhook-cert-patch job.

W1206 19:53:22.658205       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
1
{"err":"Get \"https://192.168.248.1:443/api/v1/namespaces/testkube-system/secrets/webhook-server-cert\": dial tcp 192.168.248.1:443: connect: connection refused","level":"fatal","msg":"error getting secret","source":"k8s/k8s.go:351","time":"2023-12-06T19:53:22Z"}

To Reproduce Steps to reproduce the behavior:

  1. Use kustomize build --enable-helm to template out the testkube v1.16.17 helm chart
  2. Then apply the results to the cluster that already has manifests applied from the v1.14.0 helm chart (also templated via kustomize)
  3. Wait for the webhook-cert-patch pod to be created and then error.

Expected behavior The webhook-cert-patch should complete successfully.

Version / Cluster

  • Which testkube version? 1.16
  • What Kubernetes cluster? AKS
  • What Kubernetes version? 1.26

Additional context

We are setting the jobServiceAccountName helm value, but I do not believe this should impact this non-test-related job.

I did not see this issue when applying the manifests to a brand-new cluster (on my local system), but I'm assuming that this is because the webhook does not need to be patched in this use case.

I am guessing that this issue is because the service account for that job does not have all the rights it needs, but I am not 100% sure.

spkane avatar Dec 06 '23 20:12 spkane

cc/ @manidharanupoju24

spkane avatar Dec 06 '23 20:12 spkane

To add the above context, this failed job webhook-cert-patch is causing the testtriggers to fail.

manidharanupoju24 avatar Dec 06 '23 21:12 manidharanupoju24

hey @spkane v1.1.* is really old one, we might even use cert manager at that time. Sounds like something is missed, like rbac permissions, because you can get secret error getting secret

vsukhin avatar Dec 07 '23 09:12 vsukhin

@vsukhin Sorry, that was a typo (corrected). The initial version is 1.14.0.

spkane avatar Dec 07 '23 15:12 spkane

From what I can tell this only happens during and upgrade and not a clean install.

spkane avatar Dec 07 '23 15:12 spkane

@ypoplavs @dejanzele any ideas?

vsukhin avatar Dec 07 '23 18:12 vsukhin

Hello @spkane @manidharanupoju24,

We use kube-webhook-certgen for generating a self-signed certificate and patching the CRDs and WebhookConfiguration objects.

It has two steps: generate & patch.

For the patch step, it would require a service account which has the following RBAC - https://github.com/kubeshop/helm-charts/blob/develop/charts/testkube-operator/templates/role.yaml#L404-L447

Can you check does your service account support all of the permissions?

Kind regards

dejanzele avatar Dec 07 '23 18:12 dejanzele

@dejanzele Yes. I'll try to check the permissions in the next day or so.

Does this job use the jobServiceAccountName Service account that can be set via helm (which we are using for our test jobs)? When I looked at the webhook-cert-patch job last week, I thought that it used another service account that is defined inside the Helm chart, and should therefore have all the permissions that it required.

spkane avatar Dec 11 '23 16:12 spkane

@spkane can you please paste the logs from the webhook jobs (both create & patch jobs)?

dejanzele avatar Dec 12 '23 08:12 dejanzele