pixie
pixie copied to clipboard
Pixie is missing data about many pods and services in the cluster
Describe the bug
I encountered an issue in a self-hosted installation where Pixie is missing information about the cluster
E.g. When I checked the pods in a namespace using the px/namespace
script from the UI and CLI, only 8 pods were shown. But when I checked from kubectl
, I saw 90+ pods.
Similarly, Pixie showed 6 services whereas kubectl
showed 40+ services.
Also, at times, when I try to view details of a Pod in the Pixie UI, there is no data for it.
E.g. I selected a running pod from the cluster and entered it's name in the px/pod
script in the UI. But nothing was shown for it. I could only see a No data available for inbound_requests table
message. (All the widgets in px/pod
had the same no data available
error message). The start time I set in the Pixie UI was less than the pod's uptime as well.
I also noticed that autocomplete in the Pixie UI doesn't show the correct resource at times.
E.g. In px/pod
, the pod that is shown by autocomplete does not exist in the cluster (Probably replaced by a new pod).
I noticed the following in the deployment
vizier-metadata
pod and vizier-cloud-connector
had many restarts. When I checked the pod, the state change reason for the container was shown as Error
At times, newly created pods appear in Pixie. So this doesn't seem to be a case where Pixie is unable to get any information at all about new pods
To Reproduce Not sure how to reproduce this
Expected behavior Expected to see all pods and services of the cluster in Pixie
Logs
Log containing "error" in vizier-metadata
. I am including all the repeated log lines as I want to show that they have been printed within a short interval (Multiple lines during some seconds as well)
kubectl logs -f vizier-metadata-0 -n pl | grep -i Error
time="2022-07-13T17:34:32Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:32Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:33Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:33Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:33Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:33Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:33Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:33Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:33Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:33Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:33Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:33Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:34Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:34Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:34Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:34Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:34Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:34Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:34Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:34Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:35Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:35Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:35Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:35Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:36Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:36Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:34:39Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:56:40Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T17:56:42Z" level=info msg="Failed to get update version for topic" error="<nil>"
time="2022-07-13T18:04:05Z" level=info msg="Failed to get update version for topic" error="<nil>"
vizier-cloud-connector
had the following error repeated multiple times
time="2022-07-13T18:34:46Z" level=info msg="failed vizier health check, will restart healthcheck" error="rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: INTERNAL_ERROR"
time="2022-07-13T18:34:46Z" level=info msg="failed vizier health check, will restart healthcheck" error="context canceled"
App information (please complete the following information):
- Pixie version: 0.7.14
- K8s cluster version: 1.21.9
- Node Kernel version:
- Browser version: Chrome Version 103.0.5060.114 (Official Build) (x86_64)
Additional context
As per the discussion in the Pixie Slack channel, deleted the existing vizier-metadata
pod and let the newly created one sync with the K8s API.
However pods are still missing in Pixie.
Even in the new pod, Failed to get update version for topic" error="<nil>"
error is still there. But there are no restarts even after 130m
Hi @nilushancosta ! Let's gather more information to help us debug this issue.
The logs you included (Failed to get update version for topic
) are expected when the vizier-metadata pod starts up. Do you continue seeing them occur even after the pod has been running for a while? It could be helpful to send over all of your logs, if you are comfortable with it. You can easily collect all Pixie logs by running px collect-logs
. I'm happy to receive them over Slack or email.
Another thing that may be useful is if you go to the scratchpad script in the Pixie UI and run this script:
import px
px.display(px._DebugMDState())
It will basically give you metadata debug information. Looking at the row for the kelvin pod should show all of the metadata resources its received from the metadata service. Are your resources also missing in that debug information?
@aimichelle thanks for looking into this.
The Failed to get update version for topic
error has been printed from time to time even after the pod started.
Also when I run the above script from the scratchpad, I keep getting this GRPCSinkNode 14 encountered error sending stream
error.
Let me collect the logs and send them to you directly.
Also, when this issue first occurred, I deleted the vizier-metadata
pod to see if it will recover from this problem. I noticed that this pod stores data on a Persistent Volume. I did not delete the PV at the time. Do you think deleting the PV will clear any old data and recover from this issue?
I deleted the PV along with the vizier-metadata pod and created a new PV. However this did not resolve the issue
I upgraded the Pixie deployment to the latest version. However the issue still exists.
Please find some more findings regarding the nature of this problem.
- When pods were missing in Pixie, I checked the node where those pods were scheduled. What I noticed was that it's not all pods in a given node that are missing. Some pods of a node were shown in Pixie while others weren't.
- When I restarted the PEM pod in that particular node (where the missing pod was scheduled), such missing pods started showing up in Pixie. However there was no green tick next to those pods in the list (I was using
px/namespace
) and the create time was always shown as "01/01/1970, 05:30:00" for them. The JSON details of one such pod is as follows
{
pod: NAMESPACE_REDACTED/POD_NAME_REDACTED,
rss: 36120055.43812709,
vsize: 428474368,
create_time: 0,
status: {
phase: Unknown,
reason: (empty string),
message: (empty string),
ready: false
}
}
- When I checked details about such pods (that were initially missing and then started showing up after step 2 above) from
px/pod
script, the information was present. - Missing services did not appear even after restarting the PEM pods.
- For such missing services, even when I visited
px/service
script and entered the name, no information was shown.
I am also facing the same issue. Could it be due to the memory limiter? I deployed pixie on to a GKE cluster by running px deploy --pem_memory_limit=1Gi
and this is connected to pixie's community cloud.
When I ran kubectl get pods -A | grep Running | wc -l
it says I have 68 running pods, but Pixie only shows 65 pods. Adding to that, my default namespace has about 6 services including the Kubernetes ClusterIP but only one get printed out when ran the px/namespace
script.
I was debugging one of the issues ("No data available error" explained in the issue description) and noticed the following.
- I selected a "Running" pod from the cluster for which Pixie was showing data in the
px/pod
script. I also monitored the logs of the vizier-pem pod of the node where the above pod was scheduled. When I clicked theRun
button and executed the Pixie script from the UI, vizier-pem logged the following confirming that it received a request.
I20220810 10:15:51.068285 11087 exec.cc:91] Queries in flight: 0
I20220810 10:15:51.068343 24556 exec.cc:51] Executing query: id=7bafe638-5b71-4447-8c05-6c249c4cbcf3
I20220810 10:15:51.097622 24556 exec.cc:63] Completed query: id=7bafe638-5b71-4447-8c05-6c249c4cbcf3
-
Next I found a pod that was in a "Running" state in the cluster but for which Pixie UI had the "No data available" message. For this pod when I repeated the same steps as above, the above log lines were not printed.
-
Finally I selected a different pod on the same node (as in step 2 above) and then repeated the same steps. This time, Pixie UI displayed the data and the vizier-pem pod printed the 3 log lines about the request.
Based on 2 and 3, it seems that the same vizier-pem pod responds differently for different pods in the cluster.
When checking vizier-query-broker, I noticed logs getting printed for pods for which data is displayed in Pixie and for which I am getting the "No data available" error. The lines that were logged were as follows
time="2022-08-10T10:58:44Z" level=info msg="Got passthrough request" type="*cvmsgspb.C2VAPIStreamRequest_ExecReq"
time="2022-08-10T10:58:44Z" level=info msg="Launched query: ddaf6be6-8494-4841-8194-d602745154ae"
time="2022-08-10T10:58:44Z" level=info msg="Running script" query_id=ddaf6be6-8494-4841-8194-d602745154ae
@aimichelle when I tried the px.display(px._DebugMDState())
script today it worked and I checked the data in the kelvin row like you asked. The LastUpdateTS
field was also not old and it corresponded to when I ran the debug data script. I compared the resource uid of the resources currently in the cluster with the uid listed in kelvin debug data.
- Pods listed in the Pixie UI where
Phase: Running
andready: true
(Ones with the green tick), were found in kelvin debug data. - Pods listed in the Pixie UI where
Phase: Unknown
andready: false
(Ones with the grey circle) and Pods that were running in the Cluster but were not listed in the Pixie UI could not be found in the Kelvin debug data. - There were pods listed in Kelvin debug data which were not in a running state in the cluster. (These are probably old pods that got replaced over time)
- Services shown in the Pixie UI were found in kelvin debug data. At the same time, there were services that were not listed in Pixie but were found in Kelvin debug data.
Hi @nilushancosta ! Thanks sending that debug information over. To confirm, for the pods in "Running" state in your cluster, but for which the Pixie UI had "No data available", did they show up in the Kelvin debug data? If so, do they show up in their corresponding PEM debug data?
I'll reach out directly to schedule a Zoom call so we can debug some more.
Hi @nilushancosta ! Thanks sending that debug information over. To confirm, for the pods in "Running" state in your cluster, but for which the Pixie UI had "No data available", did they show up in the Kelvin debug data? If so, do they show up in their corresponding PEM debug data?
I'll reach out directly to schedule a Zoom call so we can debug some more.
Hi @aimichelle ! I restarted the vizier-pem pods today and because of that all the pods in the cluster are now shown in Pixie (But with an unknown Phase). So I don't see the "No data available" error at the moment. I will check for sometime and get back to you.