pixie Pixie is missing data about many pods and services in the cluster

Describe the bug

I encountered an issue in a self-hosted installation where Pixie is missing information about the cluster

E.g. When I checked the pods in a namespace using the px/namespace script from the UI and CLI, only 8 pods were shown. But when I checked from kubectl , I saw 90+ pods. Similarly, Pixie showed 6 services whereas kubectl showed 40+ services.

Also, at times, when I try to view details of a Pod in the Pixie UI, there is no data for it. E.g. I selected a running pod from the cluster and entered it's name in the px/pod script in the UI. But nothing was shown for it. I could only see a No data available for inbound_requests table message. (All the widgets in px/pod had the same no data available error message). The start time I set in the Pixie UI was less than the pod's uptime as well.

I also noticed that autocomplete in the Pixie UI doesn't show the correct resource at times. E.g. In px/pod, the pod that is shown by autocomplete does not exist in the cluster (Probably replaced by a new pod).

I noticed the following in the deployment vizier-metadata pod and vizier-cloud-connector had many restarts. When I checked the pod, the state change reason for the container was shown as Error

At times, newly created pods appear in Pixie. So this doesn't seem to be a case where Pixie is unable to get any information at all about new pods

To Reproduce Not sure how to reproduce this

Expected behavior Expected to see all pods and services of the cluster in Pixie

Logs Log containing "error" in vizier-metadata. I am including all the repeated log lines as I want to show that they have been printed within a short interval (Multiple lines during some seconds as well)

kubectl logs -f vizier-metadata-0 -n pl | grep -i Error
time="2022-07-13T17:34:32Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:32Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:33Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:33Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:33Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:33Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:33Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:33Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:33Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:33Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:33Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:33Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:34Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:34Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:34Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:34Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:34Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:34Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:34Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:34Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:35Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:35Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:35Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:35Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:36Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:36Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:39Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:56:40Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:56:42Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T18:04:05Z" level=info msg="Failed to get update version for topic" error="<nil>"

vizier-cloud-connector had the following error repeated multiple times

time="2022-07-13T18:34:46Z" level=info msg="failed vizier health check, will restart healthcheck" error="rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: INTERNAL_ERROR"
time="2022-07-13T18:34:46Z" level=info msg="failed vizier health check, will restart healthcheck" error="context canceled"

App information (please complete the following information):

Pixie version: 0.7.14
K8s cluster version: 1.21.9
Node Kernel version:
Browser version: Chrome Version 103.0.5060.114 (Official Build) (x86_64)

Additional context

Jul 14 '22 09:07 nilushancosta

As per the discussion in the Pixie Slack channel, deleted the existing vizier-metadata pod and let the newly created one sync with the K8s API. However pods are still missing in Pixie.

Even in the new pod, Failed to get update version for topic" error="<nil>" error is still there. But there are no restarts even after 130m

Jul 14 '22 12:07 nilushancosta

Hi @nilushancosta ! Let's gather more information to help us debug this issue. The logs you included (Failed to get update version for topic) are expected when the vizier-metadata pod starts up. Do you continue seeing them occur even after the pod has been running for a while? It could be helpful to send over all of your logs, if you are comfortable with it. You can easily collect all Pixie logs by running px collect-logs. I'm happy to receive them over Slack or email.

Another thing that may be useful is if you go to the scratchpad script in the Pixie UI and run this script:

import px

px.display(px._DebugMDState())

It will basically give you metadata debug information. Looking at the row for the kelvin pod should show all of the metadata resources its received from the metadata service. Are your resources also missing in that debug information?

Jul 19 '22 19:07 aimichelle

@aimichelle thanks for looking into this. The Failed to get update version for topic error has been printed from time to time even after the pod started.

Also when I run the above script from the scratchpad, I keep getting this GRPCSinkNode 14 encountered error sending stream error. Screenshot 2022-07-20 at 19 15 05

Let me collect the logs and send them to you directly.

Also, when this issue first occurred, I deleted the vizier-metadata pod to see if it will recover from this problem. I noticed that this pod stores data on a Persistent Volume. I did not delete the PV at the time. Do you think deleting the PV will clear any old data and recover from this issue?

Jul 20 '22 14:07 nilushancosta

I deleted the PV along with the vizier-metadata pod and created a new PV. However this did not resolve the issue

Aug 03 '22 14:08 nilushancosta

I upgraded the Pixie deployment to the latest version. However the issue still exists.

Please find some more findings regarding the nature of this problem.

When pods were missing in Pixie, I checked the node where those pods were scheduled. What I noticed was that it's not all pods in a given node that are missing. Some pods of a node were shown in Pixie while others weren't.
When I restarted the PEM pod in that particular node (where the missing pod was scheduled), such missing pods started showing up in Pixie. However there was no green tick next to those pods in the list (I was using px/namespace) and the create time was always shown as "01/01/1970, 05:30:00" for them. The JSON details of one such pod is as follows

{
  pod: NAMESPACE_REDACTED/POD_NAME_REDACTED,
  rss: 36120055.43812709,
  vsize: 428474368,
  create_time: 0,
  status: {
    phase: Unknown,
    reason: (empty string), 
    message: (empty string),
    ready: false
  }
}

When I checked details about such pods (that were initially missing and then started showing up after step 2 above) from px/pod script, the information was present.
Missing services did not appear even after restarting the PEM pods.
For such missing services, even when I visited px/service script and entered the name, no information was shown.

Aug 05 '22 13:08 nilushancosta

I am also facing the same issue. Could it be due to the memory limiter? I deployed pixie on to a GKE cluster by running px deploy --pem_memory_limit=1Gi and this is connected to pixie's community cloud.

When I ran kubectl get pods -A | grep Running | wc -l it says I have 68 running pods, but Pixie only shows 65 pods. Adding to that, my default namespace has about 6 services including the Kubernetes ClusterIP but only one get printed out when ran the px/namespace script.

Aug 05 '22 20:08 isala404

I was debugging one of the issues ("No data available error" explained in the issue description) and noticed the following.

I selected a "Running" pod from the cluster for which Pixie was showing data in the px/pod script. I also monitored the logs of the vizier-pem pod of the node where the above pod was scheduled. When I clicked the Run button and executed the Pixie script from the UI, vizier-pem logged the following confirming that it received a request.

I20220810 10:15:51.068285 11087 exec.cc:91] Queries in flight: 0
I20220810 10:15:51.068343 24556 exec.cc:51] Executing query: id=7bafe638-5b71-4447-8c05-6c249c4cbcf3
I20220810 10:15:51.097622 24556 exec.cc:63] Completed query: id=7bafe638-5b71-4447-8c05-6c249c4cbcf3

Next I found a pod that was in a "Running" state in the cluster but for which Pixie UI had the "No data available" message. For this pod when I repeated the same steps as above, the above log lines were not printed.
Finally I selected a different pod on the same node (as in step 2 above) and then repeated the same steps. This time, Pixie UI displayed the data and the vizier-pem pod printed the 3 log lines about the request.

Based on 2 and 3, it seems that the same vizier-pem pod responds differently for different pods in the cluster.

When checking vizier-query-broker, I noticed logs getting printed for pods for which data is displayed in Pixie and for which I am getting the "No data available" error. The lines that were logged were as follows

time="2022-08-10T10:58:44Z" level=info msg="Got passthrough request" type="*cvmsgspb.C2VAPIStreamRequest_ExecReq"
time="2022-08-10T10:58:44Z" level=info msg="Launched query: ddaf6be6-8494-4841-8194-d602745154ae"
time="2022-08-10T10:58:44Z" level=info msg="Running script" query_id=ddaf6be6-8494-4841-8194-d602745154ae

@aimichelle when I tried the px.display(px._DebugMDState()) script today it worked and I checked the data in the kelvin row like you asked. The LastUpdateTS field was also not old and it corresponded to when I ran the debug data script. I compared the resource uid of the resources currently in the cluster with the uid listed in kelvin debug data.

Pods listed in the Pixie UI where Phase: Running and ready: true (Ones with the green tick), were found in kelvin debug data.
Pods listed in the Pixie UI where Phase: Unknown and ready: false (Ones with the grey circle) and Pods that were running in the Cluster but were not listed in the Pixie UI could not be found in the Kelvin debug data.
There were pods listed in Kelvin debug data which were not in a running state in the cluster. (These are probably old pods that got replaced over time)
Services shown in the Pixie UI were found in kelvin debug data. At the same time, there were services that were not listed in Pixie but were found in Kelvin debug data.

Aug 10 '22 15:08 nilushancosta

Hi @nilushancosta ! Thanks sending that debug information over. To confirm, for the pods in "Running" state in your cluster, but for which the Pixie UI had "No data available", did they show up in the Kelvin debug data? If so, do they show up in their corresponding PEM debug data?

I'll reach out directly to schedule a Zoom call so we can debug some more.

Aug 16 '22 18:08 aimichelle

Hi @nilushancosta ! Thanks sending that debug information over. To confirm, for the pods in "Running" state in your cluster, but for which the Pixie UI had "No data available", did they show up in the Kelvin debug data? If so, do they show up in their corresponding PEM debug data?

I'll reach out directly to schedule a Zoom call so we can debug some more.

Hi @aimichelle ! I restarted the vizier-pem pods today and because of that all the pods in the cluster are now shown in Pixie (But with an unknown Phase). So I don't see the "No data available" error at the moment. I will check for sometime and get back to you.

Aug 17 '22 13:08 nilushancosta

pixie pixie copied to clipboard

Pixie is missing data about many pods and services in the cluster

pixie
pixie copied to clipboard