pixie cert-provisioner-job failed due to missing pl-cloud-config ConfigMap

Describe the bug

Running px deploy got error FATA[0368] Timed out waiting for cluster ID assignment Inspecting pods showed that the cert-provisioner-job was in CreateContainerConfigError state.

Output from kubectl describe pod cert-provisioner-job-lwm57 -n pl:

Name:         cert-provisioner-job-lwm57
Namespace:    pl
Priority:     0
Node:         XXXX
Start Time:   Tue, 07 Sep 2021 11:38:36 +0100
Labels:       app=pl-monitoring
              component=vizier
              controller-uid=e0f5dc27-97b1-42e1-a036-b67ca85cd623
              job-name=cert-provisioner-job
              vizier-bootstrap=true
              vizier-name=pixie
Annotations:  vizier-name: pixie
Status:       Pending
IP:           10.28.0.133
IPs:
  IP:           10.28.0.133
Controlled By:  Job/cert-provisioner-job
Containers:
  provisioner:
    Container ID:
    Image:          gcr.io/pixie-oss/pixie-prod/vizier/cert_provisioner_image:0.9.1
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CreateContainerConfigError
    Ready:          False
    Restart Count:  0
    Environment Variables from:
      pl-cloud-config    ConfigMap  Optional: false
      pl-cluster-config  ConfigMap  Optional: true
    Environment:
      PL_NAMESPACE:  pl (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from updater-service-account-token-9gfgt (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  updater-service-account-token-9gfgt:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  updater-service-account-token-9gfgt
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  4m3s                  default-scheduler  Successfully assigned pl/cert-provisioner-job-lwm57 to XXXX
  Normal   Pulled     106s (x12 over 4m1s)  kubelet            Container image "gcr.io/pixie-oss/pixie-prod/vizier/cert_provisioner_image:0.9.1" already present on machine
  Warning  Failed     94s (x13 over 4m1s)   kubelet            Error: configmap "pl-cloud-config" not found

To Reproduce

It may, or may not, be relevant that my first run of px deploy failed as follows:

❯ px deploy
Pixie CLI

Running Cluster Checks:
 ✔    Kernel version > 4.14.0
 ✔    Cluster type is supported
 ✔    K8s version > 1.12.0
 ✔    Kubectl > 1.10.0 is present
 ✔    User can create namespace
 ✔    Cluster type is in list of known supported types
Installing Vizier version: 0.9.1
Generating YAMLs for Pixie
Deploying Pixie to the following cluster: XXXXX
Is the cluster correct? (y/n) [y] :
Found 3 nodes
 ✔    Installing OLM CRDs
 ✔    Deploying OLM
 ✔    Deploying Pixie OLM Namespace
 ✔    Installing Vizier CRD
 ✔    Deploying OLM Catalog
 ✔    Deploying OLM Subscription
 ✔    Creating namespace
 ✔    Deploying Vizier
 ✔    Waiting for Cloud Connector to come online
Waiting for Pixie to pass healthcheck
 ✔    Wait for PEMs/Kelvin
 ⠙    Wait for healthcheck
Failed to get auth credentials: open /Users/liz/.pixie/auth.json: too many open files

Trying a second time (in case fewer files were open):

px deploy
Pixie CLI

Running Cluster Checks:
 ✔    Kernel version > 4.14.0
 ✔    Cluster type is supported
 ✔    K8s version > 1.12.0
 ✔    Kubectl > 1.10.0 is present
 ✔    User can create namespace
 ✔    Cluster type is in list of known supported types
Installing Vizier version: 0.9.1
Generating YAMLs for Pixie
Deploying Pixie to the following cluster: XXXXX
Is the cluster correct? (y/n) [y] : y
Found 3 nodes
 ✔    Installing OLM CRDs
 ✔    Deploying OLM
 ✔    Deploying Pixie OLM Namespace
 ✔    Installing Vizier CRD
 ✔    Deploying OLM Catalog
 ✔    Deploying OLM Subscription
 ✔    Creating namespace
 ✔    Deploying Vizier
 ⠦    Waiting for Cloud Connector to come online
FATA[0368] Timed out waiting for cluster ID assignment

❯ px version
Pixie CLI
0.6.6+Distribution.de1f118.20210904012616.1

❯ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"132a687512d7fb058d0f5890f07d4121b3f0a2e2", GitTreeState:"clean", BuildDate:"2021-05-12T12:40:09Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.8-gke.2100", GitCommit:"4cd085fda961821985d176d25b67445c1efb6ba1", GitTreeState:"clean", BuildDate:"2021-07-16T09:22:57Z", GoVersion:"go1.15.13b5", Compiler:"gc", Platform:"linux/amd64"}

Sep 07 '21 10:09 lizrice

Thanks @lizrice ! We recently just switched to an operator-based deploy of Pixie, and are still working to make the operator more robust after a failed installation.

Would it be possible to provide logs from the vizier-operator-* pod in the px-operator namespace?

Could you try manually deleting the operator's namespace (px-operator) and the Pixie pl namespace to clean up any leftover state, and then try a redeploy?

Sep 07 '21 17:09 aimichelle

I'm getting the same "too many open files" in my failing deployment -

me@manager:~$ kubectl get all -n px-operator
NAME                                                                  READY   STATUS      RESTARTS     AGE
pod/177f7c084ce4fa776a013801422825de9e422efd983b350028645f0b79wr6wz   0/1     Completed   0            10h
pod/pixie-operator-index-48z5g                                        1/1     Running     1 (9h ago)   10h
pod/vizier-operator-68b568599-cfvmr                                   1/1     Running     1 (9h ago)   10h

NAME                           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)     AGE
service/pixie-operator-index   ClusterIP   10.152.183.79   <none>        50051/TCP   10h

NAME                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/vizier-operator   1/1     1            1           10h

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/vizier-operator-68b568599   1         1         1       10h

NAME                                                                        COMPLETIONS   DURATION   AGE
job.batch/177f7c084ce4fa776a013801422825de9e422efd983b350028645f0b798c1ed   1/1           9s         10h
me@manager:~$ kubectl logs -n px-operator pod/vizier-operator-68b568599-cfvmr
time="2022-02-12T08:04:06Z" level=info msg="Starting manager"
time="2022-02-12T08:04:06Z" level=info msg=Reconciling... req=pl/pixie
time="2022-02-12T08:04:06Z" level=info msg="Versions matched, nothing to do"
time="2022-02-12T08:04:26Z" level=info msg=Reconciling... req=pl/pixie
time="2022-02-12T08:04:26Z" level=info msg="Deleting Vizier..." req=pl/pixie
time="2022-02-12T08:04:26Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:04:46Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:05:06Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:05:26Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:05:46Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:06:02Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:06:06Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:06:08Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:06:08Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:06:26Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:06:41Z" level=error msg="Failed to get vizier" error="Vizier.px.dev \"pixie\" not found"
time="2022-02-12T08:06:45Z" level=info msg="Received cancel, stopping status reconciler"
time="2022-02-12T08:13:44Z" level=info msg=Reconciling... req=pl/pixie
time="2022-02-12T08:13:44Z" level=info msg="Creating a new vizier instance"
time="2022-02-12T08:13:44Z" level=info msg="Starting a vizier deploy"
time="2022-02-12T08:13:45Z" level=info msg="Deploying Vizier configs and secrets"
time="2022-02-12T08:13:45Z" level=info msg="Generating certs"
time="2022-02-12T08:13:50Z" level=info msg="Deploying NATS"
time="2022-02-12T08:13:51Z" level=info msg="Deploying Vizier"
time="2022-02-12T08:14:49Z" level=info msg=Reconciling... req=pl/pixie
time="2022-02-12T08:14:49Z" level=info msg="Already in the process of updating, nothing to do"
time="2022-02-12T08:14:49Z" level=info msg=Reconciling... req=pl/pixie
time="2022-02-12T08:14:49Z" level=info msg="Versions matched, nothing to do"
time="2022-02-12T08:15:09Z" level=info msg=Reconciling... req=pl/pixie
time="2022-02-12T08:15:09Z" level=info msg="Versions matched, nothing to do"
me@manager:~$ date
Sat Feb 12 12:50:41 EST 2022
me@manager:~$

Just looking at the first post for more details you want before I blow all this away ...

me@manager:~$ px deploy
Pixie CLI

Running Cluster Checks:
✔    Kernel version > 4.14.0
✔    Cluster type is supported
✔    K8s version > 1.16.0
✔    Kubectl > 1.10.0 is present
✔    User can create namespace
✕    Cluster type is in list of known supported types  ERR: Cluster type is not in list of known supported cluster types. Please see: https://docs.px.dev/installing-pixie/requirements/
Some cluster checks failed. Pixie may not work properly on your cluster. Continue with deploy? (y/n) [y] :
Installing Vizier version: 0.10.10
Generating YAMLs for Pixie
Deploying Pixie to the following cluster: argo
Is the cluster correct? (y/n) [y] :
Found 5 nodes
✔    Installing OLM CRDs
✔    Deploying OLM
✔    Deploying Pixie OLM Namespace
✔    Installing Vizier CRD
✔    Deploying OLM Catalog
✔    Deploying OLM Subscription
✔    Creating namespace
✔    Deploying Vizier
✔    Waiting for Cloud Connector to come online
Waiting for Pixie to pass healthcheck
✔    Wait for PEMs/Kelvin
⠼    Wait for healthcheck
Failed to get auth credentials: open /home/rob/.pixie/auth.json: too many open files
me@manager:~$

me@manager:~$ px version
Pixie CLI
0.7.2+Distribution.91e72cd.20211217141238.1

me@manager:~$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.3", GitCommit:"816c97ab8cff8a1c72eccca1026f7820e93e0d25", GitTreeState:"clean", BuildDate:"2022-01-26T02:20:15Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.3-2+d441060727c463", GitCommit:"d441060727c4632b67d09c9118a36a8590308676", GitTreeState:"clean", BuildDate:"2022-01-26T21:57:05Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"linux/amd64"}
me@manager:~$

Feb 12 '22 17:02 hyacin75

I seem to be able to reproduce this at will on one of my clusters, so if there is any more useful information I can provide, please let me know. I'll be completely offline for a couple weeks starting in a few days though.

Edit: Alright, I tracked my issue down thanks to a stack exchange (or something) comment that said something about needed persistent storage. I did have it and has the same SC that was working perfectly fine for my other cluster, but for whatever reason it wasn't working for this cluster (even though PVs were being created). When I ditched that CSI, created a new one and made a new default SC on top of it, it installed without a hitch, quickly, on the first try.

Hopefully this tidbit of info helps someone in the future!

Feb 12 '22 22:02 hyacin75

For anyone bumping into this. For me it was enough to delete the pvc related and redeploy.

Apr 25 '22 12:04 thunko