kubeflow-manifests
kubeflow-manifests copied to clipboard
cert-manager webhooks fails with reason `FailedDiscoveryCheck`
cert-manager installed with kubeflow fails with the following error
status:
conditions:
- lastTransitionTime: "2022-06-30T19:18:55Z"
message: 'failing or missing response from https://<ip>:10251/apis/webhook.cert-manager.io/v1beta1:
bad status from https://<ip>:10251/apis/webhook.cert-manager.io/v1beta1:
404'
reason: FailedDiscoveryCheck
status: "False"
type: Available
pods in cert-manager namespace:
kc get pods -n cert-manager
NAME READY STATUS RESTARTS AGE
cert-manager-66b646d76-8bz6r 1/1 Running 0 99d
cert-manager-cainjector-59dc9659c7-7r66d 1/1 Running 0 99d
cert-manager-webhook-7fbcc4bfcb-6kgm6 1/1 Running 0 99d
webhook deployment yaml from kubeflow manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: cert-manager-webhook
namespace: "cert-manager"
labels:
app: webhook
app.kubernetes.io/name: webhook
app.kubernetes.io/instance: cert-manager
app.kubernetes.io/component: "webhook"
app.kubernetes.io/version: "v1.5.0"
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: webhook
app.kubernetes.io/instance: cert-manager
app.kubernetes.io/component: "webhook"
template:
metadata:
labels:
app: webhook
app.kubernetes.io/name: webhook
app.kubernetes.io/instance: cert-manager
app.kubernetes.io/component: "webhook"
app.kubernetes.io/version: "v1.5.0"
spec:
serviceAccountName: cert-manager-webhook
securityContext:
runAsNonRoot: true
hostNetwork: true
containers:
- name: cert-manager
image: "quay.io/jetstack/cert-manager-webhook:v1.5.0"
imagePullPolicy: IfNotPresent
args:
- --v=2
- --secure-port=10251
- --dynamic-serving-ca-secret-namespace=$(POD_NAMESPACE)
- --dynamic-serving-ca-secret-name=cert-manager-webhook-ca
- --dynamic-serving-dns-names=cert-manager-webhook,cert-manager-webhook.cert-manager,cert-manager-webhook.cert-manager.svc
ports:
- name: https
protocol: TCP
containerPort: 10251
livenessProbe:
httpGet:
path: /livez
port: 6080
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 1
successThreshold: 1
failureThreshold: 3
readinessProbe:
httpGet:
path: /healthz
port: 6080
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 1
successThreshold: 1
failureThreshold: 3
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
resources:
{}
Expected behavior webhook should Pass the discovery check
Environment
- Kubernetes version: 1.20
- Using EKS (yes/no), if so version? yes, 1.20
- Kubeflow version 1.5
- AWS build number: This was installed through kubeflow manifests from kubeflow repo with kustomize
- AWS service targeted (S3, RDS, etc.)
While looking for the solution online i found some other having similar issue on GKE resolve it with firewall changes https://github.com/cert-manager/cert-manager/issues/2109#issuecomment-535901422 not sure if this required with EKS as the installation instructions i followed was from kubeflow than using awslabs's kubeflow manifests? Any help in resolving this would be really helpful!
Did you try installing recently? I see that the age of the pods is 99days old. Are you trying to update from a previous version? I suggest you try to use https://github.com/awslabs/kubeflow-manifests/releases/tag/v1.5.1-aws-b1.0.2 as it contains some bugfixes. I would attempt to delete the manifests then re-apply if this is the case.
@ryansteakley no i'm not trying to update it and yes it was installed 99 days ago, i didn't pay much attention to the status of the webhook back then, but i'm assuming that has been the status since then. Any idea if this can be related to the EKS firewall like that of GKE?
Would you consider, reinstalling with the latest tag of release of v1.5.1 many improvements have been made since v1.5.0. I haven't personally encountered this issue before, so am not sure if it would be related to the EKS firewall. @rrrkharse or @surajkota have you run into this before?
yes, i'll reinstall it next week but was wondering if there was a fix that would let me fix it than performing a full reinstall.
Do you have any steps I can take to reproduce this error? How did you originally install the manifests, are you using a private vpc ?
it was based on instructions provided here https://github.com/kubeflow/manifests/tree/v1.5-branch#install-individual-components
EKS nodes are in private VPC
@pthalasta can you also paste the complete spec, status and logs of the webhook pods? Which pod do you see the error you have pasted above?
Since you have a private VPC, I suspect this is an issue with the security group settings in your cluster. Check what ports are allowed from cluster security group to nodegroup security group
@surajkota so the error that is see is from the output of k8 apiserver for cert-manager's webhook
$ kc get apiservice v1beta1.webhook.cert-manager.io -o yaml
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
annotations:
cert-manager.io/inject-ca-from-secret: cert-manager/cert-manager-webhook-ca
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"apiregistration.k8s.io/v1beta1","kind":"APIService","metadata":{"annotations":{"cert-manager.io/inject-ca-from-secret":"cert-manager/cert-manager-webhook-ca"},"labels":{"app":"webhook"},"name":"v1beta1.webhook.cert-manager.io"},"spec":{"group":"webhook.cert-manager.io","groupPriorityMinimum":1000,"service":{"name":"cert-manager-webhook","namespace":"cert-manager"},"version":"v1beta1","versionPriority":15}}
creationTimestamp: "2022-06-30T19:18:55Z"
labels:
app: webhook
name: v1beta1.webhook.cert-manager.io
resourceVersion: "323409080"
uid: <uuid>
spec:
caBundle: <cert>
group: webhook.cert-manager.io
groupPriorityMinimum: 1000
service:
name: cert-manager-webhook
namespace: cert-manager
port: 443
version: v1beta1
versionPriority: 15
status:
conditions:
- lastTransitionTime: "2022-06-30T19:18:55Z"
message: 'failing or missing response from https://<ip>:10251/apis/webhook.cert-manager.io/v1beta1:
bad status from https://<ip>:10251/apis/webhook.cert-manager.io/v1beta1:
404'
reason: FailedDiscoveryCheck
status: "False"
type: Available
@surajkota we have all ports and protocols allowed from the EKS/cluster SG to instance SG. Not sure, if there are any other checks that would allow us to debug further
@pthalasta any update from your side on this? Were you able to deploy 1.6.1 successfully?
@surajkota we are working on integrating the terraform scripts with our infrastructure scripts. Should have more update by end of next week.
@surajkota to confirm, does the terraform scripts provided within the repo deploy a new EKS cluster even if we already have a cluster? Can the deployment of EKS and other resources like VPC be made optional by setting any flags within the terraform scripts?
@surajkota closing the issue as this has been resolved with aws based manifest.