pixie icon indicating copy to clipboard operation
pixie copied to clipboard

Pixie namespace hard down after GKE node pool upgrade

Open wburnett opened this issue 3 years ago • 1 comments

Describe the bug After a GKE node pool upgrade, the vizier pem pods, query broker and metadata pods, and kelvin pod are stuck in the init container phase. This can be fixed by uninstalling and reinstalling px on the cluster in question, but it some extra management overhead.

To Reproduce Steps to reproduce the behavior:

  1. Create kube cluster in GKE
  2. Wait for automatic node pool upgrade for your channel
  3. Watch for pod failure as they restart

Expected behavior I would expect/hope the pods would come back up gracefully on their own after an upgrade

Screenshots Screen Shot 2021-06-28 at 11 28 17 AM

Logs I already reinstalled pixie on the cluster (px delete and px deploy) so this won't have helpful information this time around. I can run logs next time this happens.

App information (please complete the following information):

  • Pixie version: 0.5.12+Distribution.a945b68.20210525150215.1

  • K8s cluster version: 1.18.17-gke.1901

Additional context Add any other context about the problem here.

wburnett avatar Jun 28 '21 15:06 wburnett

Yaxiong from the Pixie team:

Did you by any chance collect the description (kubectl describe pod ...) of the pods, and the logs (kubectl logs ...) when you noticed this issue?

We've been using gke cluster with node pool scaling, and had not noticed this issue before. Granted that we usually have very small clusters (2 nodes), and we often redeploy the cluster (px delete && skaffold ...) in order to verify the changes we made during development. These situations might cause differences in behaviors.

yzhao1012 avatar Jul 02 '21 16:07 yzhao1012