pixie icon indicating copy to clipboard operation
pixie copied to clipboard

scriptmgr-server failing readiness and liveness probes

Open hammadahmed1985 opened this issue 1 year ago • 1 comments

Describe the bug I am trying to run a self-hosted pixie in a 3 node cluster. Here's how my env looks like: Kubernetes Version: v1.28.2 OS-Image: Rocky Linux 8.9 (Green Obsidian) Kernel Version: 5.4.266-1.el8.elrepo.x86_64 Container Runtime: containerd://1.6.26 Pixie Cloud Version: 0.1.7

$ kubectl get nodes -o wide
NAME                     STATUS   ROLES           AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                           KERNEL-VERSION                CONTAINER-RUNTIME
rdev5-rocky8-control-1   Ready    control-plane   43d   v1.28.2   <none>        Rocky Linux 8.9 (Green Obsidian)   5.4.266-1.el8.elrepo.x86_64   containerd://1.6.26
rdev5-rocky8-worker-1    Ready    <none>          43d   v1.28.2   <none>        Rocky Linux 8.9 (Green Obsidian)   5.4.266-1.el8.elrepo.x86_64   containerd://1.6.26
rdev5-rocky8-worker-2    Ready    <none>          43d   v1.28.2   <none>        Rocky Linux 8.9 (Green Obsidian)   5.4.266-1.el8.elrepo.x86_64   containerd://1.6.26

To Reproduce Steps to reproduce the behavior: https://docs.px.dev/installing-pixie/install-guides/self-hosted-pixie/#1.-deploy-pixie-cloud

Expected behavior Pixie self-hosted cloud gets deployed.

Logs Please attach the logs by running the following command:

$ kubectl -n plc describe pod/scriptmgr-server-56d97c78c7-q6s4m
  Type     Reason          Age                 From               Message
  ----     ------          ----                ----               -------
  Normal   Scheduled       18m                 default-scheduler  Successfully assigned plc/scriptmgr-server-56d97c78c7-q6s4m to rdev5-rocky8-worker-2
  Normal   AddedInterface  16m                 multus             Add eth0 [] from k8s-pod-network
  Normal   Created         15m (x2 over 16m)   kubelet            Created container scriptmgr-server
  Normal   Started         15m (x2 over 16m)   kubelet            Started container scriptmgr-server
  Normal   Killing         15m                 kubelet            Container scriptmgr-server failed liveness probe, will be restarted
  Warning  Unhealthy       15m (x12 over 16m)  kubelet            Readiness probe failed: Get "": dial tcp connect: connection refused
  Warning  Unhealthy       15m (x6 over 16m)   kubelet            Liveness probe failed: Get "": dial tcp connect: connection refused
  Normal   Pulled          11m (x7 over 16m)   kubelet            Container image "gcr.io/pixie-oss/pixie-prod/cloud/scriptmgr_server_image:0.1.7" already present on machine

hammadahmed1985 avatar Feb 22 '24 20:02 hammadahmed1985

I've faced the same issue described here #1838

I got these errors in logs scriptmgr-server

time="2024-02-14T00:29:10Z" level=error msg="Failed to update store using bundle.json from gcs." bucket=pixie-prod-artifacts error="rpc error: code = Internal desc = failed to download bundle.json" path=script-bundles/bundle-oss.json
time="2024-02-14T00:30:40Z" level=error msg="Failed to get attrs of bundle.json" bucket=pixie-prod-artifacts error="Get \"https://storage.googleapis.com/storage/v1/b/pixie-prod-artifacts/o/script-bundles%2Fbundle-oss.json?alt=json&prettyPrint=false&projection=full\": dial tcp i/o timeout" path=script-bundles/bundle-oss.json

workaround for me is to add failureThreshold and failureThreshold equal to 5

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

- yamls/cloud_deps_elastic_operator.yaml
- yamls/cloud_deps.yaml
- yamls/cloud.yaml
- yamls/cloud_ingress_grpcs.yaml
- yamls/cloud_ingress_https.yaml

- target:
    group: apps
    version: v1
    kind: Deployment
    name: scriptmgr-server
  patch: |-
    - op: add
      path: /spec/template/spec/containers/0/livenessProbe/failureThreshold
      value: 5
    - op: add
      path: /spec/template/spec/containers/0/readinessProbe/failureThreshold
      value: 5

gofrolist avatar Feb 23 '24 01:02 gofrolist