kubeflow-manifests icon indicating copy to clipboard operation
kubeflow-manifests copied to clipboard

cert-manager webhooks fails with reason `FailedDiscoveryCheck`

Open pthalasta opened this issue 2 years ago • 9 comments

cert-manager installed with kubeflow fails with the following error

status:
  conditions:
  - lastTransitionTime: "2022-06-30T19:18:55Z"
    message: 'failing or missing response from https://<ip>:10251/apis/webhook.cert-manager.io/v1beta1:
      bad status from https://<ip>:10251/apis/webhook.cert-manager.io/v1beta1:
      404'
    reason: FailedDiscoveryCheck
    status: "False"
    type: Available

pods in cert-manager namespace:

kc get pods -n cert-manager
NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-66b646d76-8bz6r               1/1     Running   0          99d
cert-manager-cainjector-59dc9659c7-7r66d   1/1     Running   0          99d
cert-manager-webhook-7fbcc4bfcb-6kgm6      1/1     Running   0          99d

webhook deployment yaml from kubeflow manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cert-manager-webhook
  namespace: "cert-manager"
  labels:
    app: webhook
    app.kubernetes.io/name: webhook
    app.kubernetes.io/instance: cert-manager
    app.kubernetes.io/component: "webhook"
    app.kubernetes.io/version: "v1.5.0"
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: webhook
      app.kubernetes.io/instance: cert-manager
      app.kubernetes.io/component: "webhook"
  template:
    metadata:
      labels:
        app: webhook
        app.kubernetes.io/name: webhook
        app.kubernetes.io/instance: cert-manager
        app.kubernetes.io/component: "webhook"
        app.kubernetes.io/version: "v1.5.0"
    spec:
      serviceAccountName: cert-manager-webhook
      securityContext:
        runAsNonRoot: true
      hostNetwork: true
      containers:
        - name: cert-manager
          image: "quay.io/jetstack/cert-manager-webhook:v1.5.0"
          imagePullPolicy: IfNotPresent
          args:
          - --v=2
          - --secure-port=10251
          - --dynamic-serving-ca-secret-namespace=$(POD_NAMESPACE)
          - --dynamic-serving-ca-secret-name=cert-manager-webhook-ca
          - --dynamic-serving-dns-names=cert-manager-webhook,cert-manager-webhook.cert-manager,cert-manager-webhook.cert-manager.svc
          ports:
          - name: https
            protocol: TCP
            containerPort: 10251
          livenessProbe:
            httpGet:
              path: /livez
              port: 6080
              scheme: HTTP
            initialDelaySeconds: 60
            periodSeconds: 10
            timeoutSeconds: 1
            successThreshold: 1
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /healthz
              port: 6080
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 1
            successThreshold: 1
            failureThreshold: 3
          env:
          - name: POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
          resources:
            {}

Expected behavior webhook should Pass the discovery check

Environment

  • Kubernetes version: 1.20
  • Using EKS (yes/no), if so version? yes, 1.20
  • Kubeflow version 1.5
  • AWS build number: This was installed through kubeflow manifests from kubeflow repo with kustomize
  • AWS service targeted (S3, RDS, etc.)

While looking for the solution online i found some other having similar issue on GKE resolve it with firewall changes https://github.com/cert-manager/cert-manager/issues/2109#issuecomment-535901422 not sure if this required with EKS as the installation instructions i followed was from kubeflow than using awslabs's kubeflow manifests? Any help in resolving this would be really helpful!

pthalasta avatar Oct 07 '22 19:10 pthalasta

Did you try installing recently? I see that the age of the pods is 99days old. Are you trying to update from a previous version? I suggest you try to use https://github.com/awslabs/kubeflow-manifests/releases/tag/v1.5.1-aws-b1.0.2 as it contains some bugfixes. I would attempt to delete the manifests then re-apply if this is the case.

ryansteakley avatar Oct 07 '22 20:10 ryansteakley

@ryansteakley no i'm not trying to update it and yes it was installed 99 days ago, i didn't pay much attention to the status of the webhook back then, but i'm assuming that has been the status since then. Any idea if this can be related to the EKS firewall like that of GKE?

pthalasta avatar Oct 07 '22 20:10 pthalasta

Would you consider, reinstalling with the latest tag of release of v1.5.1 many improvements have been made since v1.5.0. I haven't personally encountered this issue before, so am not sure if it would be related to the EKS firewall. @rrrkharse or @surajkota have you run into this before?

ryansteakley avatar Oct 07 '22 21:10 ryansteakley

yes, i'll reinstall it next week but was wondering if there was a fix that would let me fix it than performing a full reinstall.

pthalasta avatar Oct 07 '22 21:10 pthalasta

Do you have any steps I can take to reproduce this error? How did you originally install the manifests, are you using a private vpc ?

ryansteakley avatar Oct 07 '22 21:10 ryansteakley

it was based on instructions provided here https://github.com/kubeflow/manifests/tree/v1.5-branch#install-individual-components

EKS nodes are in private VPC

pthalasta avatar Oct 07 '22 21:10 pthalasta

@pthalasta can you also paste the complete spec, status and logs of the webhook pods? Which pod do you see the error you have pasted above?

Since you have a private VPC, I suspect this is an issue with the security group settings in your cluster. Check what ports are allowed from cluster security group to nodegroup security group

surajkota avatar Oct 07 '22 21:10 surajkota

@surajkota so the error that is see is from the output of k8 apiserver for cert-manager's webhook

$ kc get apiservice v1beta1.webhook.cert-manager.io -o yaml
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  annotations:
    cert-manager.io/inject-ca-from-secret: cert-manager/cert-manager-webhook-ca
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apiregistration.k8s.io/v1beta1","kind":"APIService","metadata":{"annotations":{"cert-manager.io/inject-ca-from-secret":"cert-manager/cert-manager-webhook-ca"},"labels":{"app":"webhook"},"name":"v1beta1.webhook.cert-manager.io"},"spec":{"group":"webhook.cert-manager.io","groupPriorityMinimum":1000,"service":{"name":"cert-manager-webhook","namespace":"cert-manager"},"version":"v1beta1","versionPriority":15}}
  creationTimestamp: "2022-06-30T19:18:55Z"
  labels:
    app: webhook
  name: v1beta1.webhook.cert-manager.io
  resourceVersion: "323409080"
  uid: <uuid>
spec:
  caBundle: <cert>
  group: webhook.cert-manager.io
  groupPriorityMinimum: 1000
  service:
    name: cert-manager-webhook
    namespace: cert-manager
    port: 443
  version: v1beta1
  versionPriority: 15
status:
  conditions:
  - lastTransitionTime: "2022-06-30T19:18:55Z"
    message: 'failing or missing response from https://<ip>:10251/apis/webhook.cert-manager.io/v1beta1:
      bad status from https://<ip>:10251/apis/webhook.cert-manager.io/v1beta1:
      404'
    reason: FailedDiscoveryCheck
    status: "False"
    type: Available

pthalasta avatar Oct 07 '22 22:10 pthalasta

@surajkota we have all ports and protocols allowed from the EKS/cluster SG to instance SG. Not sure, if there are any other checks that would allow us to debug further

pthalasta avatar Oct 11 '22 17:10 pthalasta

@pthalasta any update from your side on this? Were you able to deploy 1.6.1 successfully?

surajkota avatar Nov 02 '22 22:11 surajkota

@surajkota we are working on integrating the terraform scripts with our infrastructure scripts. Should have more update by end of next week.

pthalasta avatar Nov 02 '22 22:11 pthalasta

@surajkota to confirm, does the terraform scripts provided within the repo deploy a new EKS cluster even if we already have a cluster? Can the deployment of EKS and other resources like VPC be made optional by setting any flags within the terraform scripts?

pthalasta avatar Nov 02 '22 22:11 pthalasta

@surajkota closing the issue as this has been resolved with aws based manifest.

pthalasta avatar Dec 16 '22 01:12 pthalasta