cert-provisioner-job failed due to missing pl-cloud-config ConfigMap
Describe the bug
Running px deploy got error FATA[0368] Timed out waiting for cluster ID assignment
Inspecting pods showed that the cert-provisioner-job was in CreateContainerConfigError state.
Output from kubectl describe pod cert-provisioner-job-lwm57 -n pl:
Name: cert-provisioner-job-lwm57
Namespace: pl
Priority: 0
Node: XXXX
Start Time: Tue, 07 Sep 2021 11:38:36 +0100
Labels: app=pl-monitoring
component=vizier
controller-uid=e0f5dc27-97b1-42e1-a036-b67ca85cd623
job-name=cert-provisioner-job
vizier-bootstrap=true
vizier-name=pixie
Annotations: vizier-name: pixie
Status: Pending
IP: 10.28.0.133
IPs:
IP: 10.28.0.133
Controlled By: Job/cert-provisioner-job
Containers:
provisioner:
Container ID:
Image: gcr.io/pixie-oss/pixie-prod/vizier/cert_provisioner_image:0.9.1
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: CreateContainerConfigError
Ready: False
Restart Count: 0
Environment Variables from:
pl-cloud-config ConfigMap Optional: false
pl-cluster-config ConfigMap Optional: true
Environment:
PL_NAMESPACE: pl (v1:metadata.namespace)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from updater-service-account-token-9gfgt (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
updater-service-account-token-9gfgt:
Type: Secret (a volume populated by a Secret)
SecretName: updater-service-account-token-9gfgt
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m3s default-scheduler Successfully assigned pl/cert-provisioner-job-lwm57 to XXXX
Normal Pulled 106s (x12 over 4m1s) kubelet Container image "gcr.io/pixie-oss/pixie-prod/vizier/cert_provisioner_image:0.9.1" already present on machine
Warning Failed 94s (x13 over 4m1s) kubelet Error: configmap "pl-cloud-config" not found
To Reproduce
It may, or may not, be relevant that my first run of px deploy failed as follows:
❯ px deploy
Pixie CLI
Running Cluster Checks:
✔ Kernel version > 4.14.0
✔ Cluster type is supported
✔ K8s version > 1.12.0
✔ Kubectl > 1.10.0 is present
✔ User can create namespace
✔ Cluster type is in list of known supported types
Installing Vizier version: 0.9.1
Generating YAMLs for Pixie
Deploying Pixie to the following cluster: XXXXX
Is the cluster correct? (y/n) [y] :
Found 3 nodes
✔ Installing OLM CRDs
✔ Deploying OLM
✔ Deploying Pixie OLM Namespace
✔ Installing Vizier CRD
✔ Deploying OLM Catalog
✔ Deploying OLM Subscription
✔ Creating namespace
✔ Deploying Vizier
✔ Waiting for Cloud Connector to come online
Waiting for Pixie to pass healthcheck
✔ Wait for PEMs/Kelvin
⠙ Wait for healthcheck
Failed to get auth credentials: open /Users/liz/.pixie/auth.json: too many open files
Trying a second time (in case fewer files were open):
px deploy
Pixie CLI
Running Cluster Checks:
✔ Kernel version > 4.14.0
✔ Cluster type is supported
✔ K8s version > 1.12.0
✔ Kubectl > 1.10.0 is present
✔ User can create namespace
✔ Cluster type is in list of known supported types
Installing Vizier version: 0.9.1
Generating YAMLs for Pixie
Deploying Pixie to the following cluster: XXXXX
Is the cluster correct? (y/n) [y] : y
Found 3 nodes
✔ Installing OLM CRDs
✔ Deploying OLM
✔ Deploying Pixie OLM Namespace
✔ Installing Vizier CRD
✔ Deploying OLM Catalog
✔ Deploying OLM Subscription
✔ Creating namespace
✔ Deploying Vizier
⠦ Waiting for Cloud Connector to come online
FATA[0368] Timed out waiting for cluster ID assignment
❯ px version
Pixie CLI
0.6.6+Distribution.de1f118.20210904012616.1
❯ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"132a687512d7fb058d0f5890f07d4121b3f0a2e2", GitTreeState:"clean", BuildDate:"2021-05-12T12:40:09Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.8-gke.2100", GitCommit:"4cd085fda961821985d176d25b67445c1efb6ba1", GitTreeState:"clean", BuildDate:"2021-07-16T09:22:57Z", GoVersion:"go1.15.13b5", Compiler:"gc", Platform:"linux/amd64"}
Thanks @lizrice ! We recently just switched to an operator-based deploy of Pixie, and are still working to make the operator more robust after a failed installation.
Would it be possible to provide logs from the vizier-operator-* pod in the px-operator namespace?
Could you try manually deleting the operator's namespace (px-operator) and the Pixie pl namespace to clean up any leftover state, and then try a redeploy?
I'm getting the same "too many open files" in my failing deployment -
me@manager:~$ kubectl get all -n px-operator
NAME READY STATUS RESTARTS AGE
pod/177f7c084ce4fa776a013801422825de9e422efd983b350028645f0b79wr6wz 0/1 Completed 0 10h
pod/pixie-operator-index-48z5g 1/1 Running 1 (9h ago) 10h
pod/vizier-operator-68b568599-cfvmr 1/1 Running 1 (9h ago) 10h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/pixie-operator-index ClusterIP 10.152.183.79 <none> 50051/TCP 10h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/vizier-operator 1/1 1 1 10h
NAME DESIRED CURRENT READY AGE
replicaset.apps/vizier-operator-68b568599 1 1 1 10h
NAME COMPLETIONS DURATION AGE
job.batch/177f7c084ce4fa776a013801422825de9e422efd983b350028645f0b798c1ed 1/1 9s 10h
me@manager:~$ kubectl logs -n px-operator pod/vizier-operator-68b568599-cfvmr
time="2022-02-12T08:04:06Z" level=info msg="Starting manager"
time="2022-02-12T08:04:06Z" level=info msg=Reconciling... req=pl/pixie
time="2022-02-12T08:04:06Z" level=info msg="Versions matched, nothing to do"
time="2022-02-12T08:04:26Z" level=info msg=Reconciling... req=pl/pixie
time="2022-02-12T08:04:26Z" level=info msg="Deleting Vizier..." req=pl/pixie
time="2022-02-12T08:04:26Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:04:46Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:05:06Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:05:26Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:05:46Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:06:02Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:06:06Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:06:08Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:06:08Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:06:26Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:06:41Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:06:45Z" level=info msg="Received cancel, stopping status reconciler"
time="2022-02-12T08:13:44Z" level=info msg=Reconciling... req=pl/pixie
time="2022-02-12T08:13:44Z" level=info msg="Creating a new vizier instance"
time="2022-02-12T08:13:44Z" level=info msg="Starting a vizier deploy"
time="2022-02-12T08:13:45Z" level=info msg="Deploying Vizier configs and secrets"
time="2022-02-12T08:13:45Z" level=info msg="Generating certs"
time="2022-02-12T08:13:50Z" level=info msg="Deploying NATS"
time="2022-02-12T08:13:51Z" level=info msg="Deploying Vizier"
time="2022-02-12T08:14:49Z" level=info msg=Reconciling... req=pl/pixie
time="2022-02-12T08:14:49Z" level=info msg="Already in the process of updating, nothing to do"
time="2022-02-12T08:14:49Z" level=info msg=Reconciling... req=pl/pixie
time="2022-02-12T08:14:49Z" level=info msg="Versions matched, nothing to do"
time="2022-02-12T08:15:09Z" level=info msg=Reconciling... req=pl/pixie
time="2022-02-12T08:15:09Z" level=info msg="Versions matched, nothing to do"
me@manager:~$ date
Sat Feb 12 12:50:41 EST 2022
me@manager:~$
Just looking at the first post for more details you want before I blow all this away ...
me@manager:~$ px deploy
Pixie CLI
Running Cluster Checks:
✔ Kernel version > 4.14.0
✔ Cluster type is supported
✔ K8s version > 1.16.0
✔ Kubectl > 1.10.0 is present
✔ User can create namespace
✕ Cluster type is in list of known supported types ERR: Cluster type is not in list of known supported cluster types. Please see: https://docs.px.dev/installing-pixie/requirements/
Some cluster checks failed. Pixie may not work properly on your cluster. Continue with deploy? (y/n) [y] :
Installing Vizier version: 0.10.10
Generating YAMLs for Pixie
Deploying Pixie to the following cluster: argo
Is the cluster correct? (y/n) [y] :
Found 5 nodes
✔ Installing OLM CRDs
✔ Deploying OLM
✔ Deploying Pixie OLM Namespace
✔ Installing Vizier CRD
✔ Deploying OLM Catalog
✔ Deploying OLM Subscription
✔ Creating namespace
✔ Deploying Vizier
✔ Waiting for Cloud Connector to come online
Waiting for Pixie to pass healthcheck
✔ Wait for PEMs/Kelvin
⠼ Wait for healthcheck
Failed to get auth credentials: open /home/rob/.pixie/auth.json: too many open files
me@manager:~$
me@manager:~$ px version
Pixie CLI
0.7.2+Distribution.91e72cd.20211217141238.1
me@manager:~$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.3", GitCommit:"816c97ab8cff8a1c72eccca1026f7820e93e0d25", GitTreeState:"clean", BuildDate:"2022-01-26T02:20:15Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.3-2+d441060727c463", GitCommit:"d441060727c4632b67d09c9118a36a8590308676", GitTreeState:"clean", BuildDate:"2022-01-26T21:57:05Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"linux/amd64"}
me@manager:~$
I seem to be able to reproduce this at will on one of my clusters, so if there is any more useful information I can provide, please let me know. I'll be completely offline for a couple weeks starting in a few days though.
Edit: Alright, I tracked my issue down thanks to a stack exchange (or something) comment that said something about needed persistent storage. I did have it and has the same SC that was working perfectly fine for my other cluster, but for whatever reason it wasn't working for this cluster (even though PVs were being created). When I ditched that CSI, created a new one and made a new default SC on top of it, it installed without a hitch, quickly, on the first try.
Hopefully this tidbit of info helps someone in the future!
For anyone bumping into this. For me it was enough to delete the pvc related and redeploy.