Pixie UI's cluster details page fails to show unhealthy data plane pods

Open ddelnano opened this issue 9 months ago • 0 comments

Describe the bug The Pixie UI has a way to surface unhealthy pods. This is visible by visiting the Clusters icon in the side bar, selecting a cluster and clicking on the "Pixie pods" tab.

This tab is extremely helpful for debugging issues that cause a Vizier to be unhealthy. For example, if the query broker or metadata pods are failing, pxl scripts cannot execute. This limits visibility into environments where there isn't direct access to the k8s cluster the Vizier is running on.

This is where the "Pixie pods" tab comes in. It shows the pod status as seen from the K8s api server as well as events related to those pods. As long as the vizier cloud connector service is running, the rest of the Vizier's pods and events can be inspected without direct k8s cluster access.

I recently noticed that if a kelvin pod is in a crash loop backoff, it never shows up in the unhealthy data plane pod list. This inhibits debugging issues with kelvin and seems to also apply to the PEMs as well.

To Reproduce Steps to reproduce the behavior:

Skaffold a vizier with the following change:

diff --git a/src/vizier/services/agent/kelvin/kelvin_main.cc b/src/vizier/services/agent/kelvin/kelvin_main.cc
index 20dee4787..19bd2b170 100644
--- a/src/vizier/services/agent/kelvin/kelvin_main.cc
+++ b/src/vizier/services/agent/kelvin/kelvin_main.cc
@@ -99,6 +99,8 @@ int main(int argc, char** argv) {
                                        FLAGS_rpc_port, FLAGS_nats_url, mds_addr, kernel_info)
                      .ConsumeValueOrDie();

+  sleep(20);
+  LOG(FATAL) << "Simulating continuous kelvin crash";
   TerminationHandler::set_manager(manager.get());

   PX_CHECK_OK(manager->Run());

Navigate to the "Pixie pods" tab for this cluster
See that kelvin is missing from the "Sample of Unhealthy Data Plane Pods" section

Expected behavior

Kelvin should show up in this section when its in a crash loop backoff or has restarted recently

Logs

pl-vizier-cloud-connector-57bb59cffd-lpw75-1742391400868703897.log

App information (please complete the following information):

Pixie version: 0.14.15
K8s cluster version: N/A
Node Kernel version: N/A
Browser version: N/A

Additional context

This issue occurs because this check is never successful. I skaffold'ed a change to log out the kelvinPod.Status field and saw that there are other indicators that kelvin is unhealthy, but the k8s api shows the pod as "Running"

formatted kelvin log

Phase: Running
Conditions:
  - Type: PodReadyToStartContainers
    Status: True
    LastProbeTime: 0001-01-01 00:00:00 +0000 UTC
    LastTransitionTime: 2025-03-19 13:44:40 +0000 UTC
    Reason: ""
    Message: ""
  - Type: Initialized
    Status: True
    LastProbeTime: 0001-01-01 00:00:00 +0000 UTC
    LastTransitionTime: 2025-03-19 13:44:41 +0000 UTC
    Reason: ""
    Message: ""
  - Type: Ready
    Status: False
    LastProbeTime: 0001-01-01 00:00:00 +0000 UTC
    LastTransitionTime: 2025-03-19 13:47:45 +0000 UTC
    Reason: ContainersNotReady
    Message: "containers with unready status: [app]"
  - Type: ContainersReady
    Status: False
    LastProbeTime: 0001-01-01 00:00:00 +0000 UTC
    LastTransitionTime: 2025-03-19 13:47:45 +0000 UTC
    Reason: ContainersNotReady
    Message: "containers with unready status: [app]"
  - Type: PodScheduled
    Status: True
    LastProbeTime: 0001-01-01 00:00:00 +0000 UTC
    LastTransitionTime: 2025-03-19 13:44:38 +0000 UTC
    Reason: ""
    Message: ""

Message: ""
Reason: ""
NominatedNodeName: ""
HostIP: 10.130.0.70
PodIP: 10.240.192.63
PodIPs:
  - IP: 10.240.192.63
StartTime: 2025-03-19 13:44:38 +0000 UTC

InitContainerStatuses:
  - Name: cc-wait
    State:
      Terminated:
        ExitCode: 0
        Signal: 0
        Reason: Completed
        Message: ""
        StartedAt: 2025-03-19 13:44:39 +0000 UTC
        FinishedAt: 2025-03-19 13:44:39 +0000 UTC
        ContainerID: containerd://0dfbd2a45c242eab714fcadc16bfe475e38085900c73ec1f875ef7ef326f89b2
    LastTerminationState: {}
    Ready: true
    RestartCount: 0
    Image: sha256:3148ec916ea71d90f1beae623b3c5eb4a2db5a585db3178d9619bc2feb8f5f49
    ImageID: ghcr.io/pixie-io/pixie-oss-pixie-dev-public-curl@sha256:f7f265d5c64eb4463a43a99b6bf773f9e61a50aaa7cefaf564f43e42549a01dd
    ContainerID: containerd://0dfbd2a45c242eab714fcadc16bfe475e38085900c73ec1f875ef7ef326f89b2
    Started: true

  - Name: qb-wait
    State:
      Terminated:
        ExitCode: 0
        Signal: 0
        Reason: Completed
        Message: ""
        StartedAt: 2025-03-19 13:44:40 +0000 UTC
        FinishedAt: 2025-03-19 13:44:40 +0000 UTC
        ContainerID: containerd://2766733c167a22366aa8c5d71d36d6bcf08068830383b1ba6b77338cd4fe08eb
    LastTerminationState: {}
    Ready: true
    RestartCount: 0
    Image: sha256:3148ec916ea71d90f1beae623b3c5eb4a2db5a585db3178d9619bc2feb8f5f49
    ImageID: ghcr.io/pixie-io/pixie-oss-pixie-dev-public-curl@sha256:f7f265d5c64eb4463a43a99b6bf773f9e61a50aaa7cefaf564f43e42549a01dd
    ContainerID: containerd://2766733c167a22366aa8c5d71d36d6bcf08068830383b1ba6b77338cd4fe08eb
    Started: true

ContainerStatuses:
  - Name: app
    State:
      Waiting:
        Reason: CrashLoopBackOff
        Message: "back-off 1m20s restarting failed container=app pod=kelvin-648fbffc65-q5rqj_pl(5807c046-07b2-4e23-9064-20a61bcd25e3)"
    LastTerminationState:
      Terminated:
        ExitCode: 139
        Signal: 0
        Reason: Error
        Message: ""
        StartedAt: 2025-03-19 13:47:23 +0000 UTC
        FinishedAt: 2025-03-19 13:47:44 +0000 UTC
        ContainerID: containerd://d375a7a35c0a601674f91789d86d70e18a6745a5da4d32a40540e2f64d509dc7
    Ready: false
    RestartCount: 4
    Image: sha256:3d4e10294dca88e36ddc32608ff4052d4933f64366884455a09f1ea76cc71247
    ImageID: us-west1-docker.pkg.dev/csmc-dev/csmc-releases/vizier-kelvin_image@sha256:64775780e811445e7cffade2d02ba1fa3dcc2295326a5b704aad556409f61680
    ContainerID: containerd://d375a7a35c0a601674f91789d86d70e18a6745a5da4d32a40540e2f64d509dc7
    Started: false

QOSClass: BestEffort
EphemeralContainerStatuses: []

The full log has been uploaded under the Logs header above. I used the following diff to collect those logs in addition to seeing that it surfaced the kelvin pod status and events I was expecting:

diff --git a/src/vizier/services/cloud_connector/bridge/vzinfo.go b/src/vizier/services/cloud_connector/bridge/vzinfo.go
index 4e967d152..9d4b76734 100644
--- a/src/vizier/services/cloud_connector/bridge/vzinfo.go
+++ b/src/vizier/services/cloud_connector/bridge/vzinfo.go
@@ -384,14 +384,17 @@ func (v *K8sVizierInfo) getDataPlaneState() (int32, int32, map[string]*cvmsgspb.
        kelvinPodsList, err := v.clientset.CoreV1().Pods(v.ns).List(context.Background(), metav1.ListOptions{
                LabelSelector: "name=kelvin",
        })
+       fmt.Printf("Kelvin pods: %+v\n", kelvinPodsList)
        if err != nil {
                log.WithError(err).Error("Error fetching Kelvin pods")
                return 0, 0, nil, err
        }
        for _, kelvinPod := range kelvinPodsList.Items {
-               if kelvinPod.Status.Phase != corev1.PodRunning {
-                       unhealthyDataPlanePods = append(unhealthyDataPlanePods, kelvinPod)
-               }
+               fmt.Printf("Kelvin pod loop: %+v\n", kelvinPod.Status)
+               // if kelvinPod.Status.Phase != corev1.PodRunning {
+               fmt.Println("Appending kelvin pod to unhealthy data plane pods")
+               unhealthyDataPlanePods = append(unhealthyDataPlanePods, kelvinPod)
+               // }
        }

        var unhealthyPEMPods []corev1.Pod
@@ -435,12 +438,14 @@ func (v *K8sVizierInfo) getDataPlaneState() (int32, int32, map[string]*cvmsgspb.
 // UpdateK8sState gets the relevant state of the cluster, such as pod statuses, at the current moment in time.
 func (v *K8sVizierInfo) UpdateK8sState() {
        controlPlanePods, err := v.getControlPlanePodStatuses()
+       fmt.Printf("Control plane pods: %+v\n", controlPlanePods)
        if err != nil {
                log.WithError(err).Error("Error fetching control plane pod statuses")
                return
        }

        numNodes, numInstrumentedNodes, unhealthyDataPlanePods, err := v.getDataPlaneState()
+       fmt.Printf("Num nodes: %d, num instrumented nodes: %d unhealthy data plane pods: %+v\n", numNodes, numInstrumentedNodes, unhealthyDataPlanePods)
        if err != nil {
                log.WithError(err).Error("Error fetching data plane pod information")
                return
@@ -482,6 +487,8 @@ func (v *K8sVizierInfo) GetK8sState() *K8sState {
        v.mu.Lock()
        defer v.mu.Unlock()

+       fmt.Printf("controlPlanePodStatuses: %+v\n", v.controlPlanePodStatuses)
+       fmt.Printf("unhealthyDataPlanePodStatuses: %+v\n", v.unhealthyDataPlanePodStatuses)
        return &K8sState{
                ControlPlanePodStatuses:       copyPodStatus(v.controlPlanePodStatuses),
                UnhealthyDataPlanePodStatuses: copyPodStatus(v.unhealthyDataPlanePodStatuses),

Mar 19 '25 15:03 ddelnano