pixie
pixie copied to clipboard
scriptmgr-server failing readiness and liveness probes
Describe the bug I am trying to run a self-hosted pixie in a 3 node cluster. Here's how my env looks like: Kubernetes Version: v1.28.2 OS-Image: Rocky Linux 8.9 (Green Obsidian) Kernel Version: 5.4.266-1.el8.elrepo.x86_64 Container Runtime: containerd://1.6.26 Pixie Cloud Version: 0.1.7
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
rdev5-rocky8-control-1 Ready control-plane 43d v1.28.2 10.76.110.148 <none> Rocky Linux 8.9 (Green Obsidian) 5.4.266-1.el8.elrepo.x86_64 containerd://1.6.26
rdev5-rocky8-worker-1 Ready <none> 43d v1.28.2 10.76.110.140 <none> Rocky Linux 8.9 (Green Obsidian) 5.4.266-1.el8.elrepo.x86_64 containerd://1.6.26
rdev5-rocky8-worker-2 Ready <none> 43d v1.28.2 10.76.110.136 <none> Rocky Linux 8.9 (Green Obsidian) 5.4.266-1.el8.elrepo.x86_64 containerd://1.6.26
To Reproduce Steps to reproduce the behavior: https://docs.px.dev/installing-pixie/install-guides/self-hosted-pixie/#1.-deploy-pixie-cloud
Expected behavior Pixie self-hosted cloud gets deployed.
Logs Please attach the logs by running the following command:
$ kubectl -n plc describe pod/scriptmgr-server-56d97c78c7-q6s4m
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 18m default-scheduler Successfully assigned plc/scriptmgr-server-56d97c78c7-q6s4m to rdev5-rocky8-worker-2
Normal AddedInterface 16m multus Add eth0 [192.168.84.196/32] from k8s-pod-network
Normal Created 15m (x2 over 16m) kubelet Created container scriptmgr-server
Normal Started 15m (x2 over 16m) kubelet Started container scriptmgr-server
Normal Killing 15m kubelet Container scriptmgr-server failed liveness probe, will be restarted
Warning Unhealthy 15m (x12 over 16m) kubelet Readiness probe failed: Get "https://192.168.84.196:52000/healthz": dial tcp 192.168.84.196:52000: connect: connection refused
Warning Unhealthy 15m (x6 over 16m) kubelet Liveness probe failed: Get "https://192.168.84.196:52000/healthz": dial tcp 192.168.84.196:52000: connect: connection refused
Normal Pulled 11m (x7 over 16m) kubelet Container image "gcr.io/pixie-oss/pixie-prod/cloud/scriptmgr_server_image:0.1.7" already present on machine
I've faced the same issue described here #1838
I got these errors in logs scriptmgr-server
time="2024-02-14T00:29:10Z" level=error msg="Failed to update store using bundle.json from gcs." bucket=pixie-prod-artifacts error="rpc error: code = Internal desc = failed to download bundle.json" path=script-bundles/bundle-oss.json
time="2024-02-14T00:30:40Z" level=error msg="Failed to get attrs of bundle.json" bucket=pixie-prod-artifacts error="Get \"https://storage.googleapis.com/storage/v1/b/pixie-prod-artifacts/o/script-bundles%2Fbundle-oss.json?alt=json&prettyPrint=false&projection=full\": dial tcp 142.250.176.27:443: i/o timeout" path=script-bundles/bundle-oss.json
workaround for me is to add failureThreshold
and failureThreshold
equal to 5
---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- yamls/cloud_deps_elastic_operator.yaml
- yamls/cloud_deps.yaml
- yamls/cloud.yaml
- yamls/cloud_ingress_grpcs.yaml
- yamls/cloud_ingress_https.yaml
patches:
- target:
group: apps
version: v1
kind: Deployment
name: scriptmgr-server
patch: |-
- op: add
path: /spec/template/spec/containers/0/livenessProbe/failureThreshold
value: 5
- op: add
path: /spec/template/spec/containers/0/readinessProbe/failureThreshold
value: 5