pixie icon indicating copy to clipboard operation
pixie copied to clipboard

Support for Anthos

Open zackb opened this issue 3 years ago • 0 comments

Describe the bug When installing pixie on an Anthos Google cluster, the PEM pods fail to initialize due to some DNS problem.

To Reproduce Steps to reproduce the behavior:

  1. Anthos cluster version v1.19.10-gke.1600 asmv : 1-9-5-asm-2
  2. run px deploy on the cluster
  3. Notice all of the PEM pods are stuck in Init:0/1 forever state and the deployment times out

Expected behavior Pixie installs like it does on vanilla GKE

Screenshots If applicable, add screenshots to help explain your problem. Please make sure the screenshot does not contain any sensitive information such as API keys or access tokens.

Logs

time="2021-07-01T16:56:43Z" level=info msg="Starting service" service=query-broker version=0.7.17+Distribution.801d7bf.20210628182125.1
time="2021-07-01T16:56:43Z" level=info msg="Loading HTTP TLS certs" tlsCA=/certs/ca.crt tlsCertFile=/certs/client.crt tlsKeyFile=/certs/client.key
time="2021-07-01T16:56:43Z" level=info msg="[core] parsed scheme: \"\"" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] scheme \"\" not registered, fallback to default scheme" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] ccResolverWrapper: sending update to cc: {[{vizier-metadata.pl.svc:50400  <nil> 0 <nil>}] <nil> <nil>}" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] ClientConn switching balancer to \"round_robin\"" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Channel switches to new LB policy \"round_robin\"" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Subchannel Connectivity change to CONNECTING" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Subchannel picks a new address \"vizier-metadata.pl.svc:50400\" to connect" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Channel Connectivity change to CONNECTING" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Subchannel Connectivity change to READY" system=system
time="2021-07-01T16:56:43Z" level=info msg="[roundrobin] roundrobinPicker: newPicker called with info: {map[0xc0002257b0:{{vizier-metadata.pl.svc:50400  <nil> 0 <nil>}}]}" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Channel Connectivity change to READY" system=system
time="2021-07-01T16:56:43Z" level=info msg="Loading HTTP TLS certs" tlsCA=/certs/ca.crt tlsCertFile=/certs/client.crt tlsKeyFile=/certs/client.key
time="2021-07-01T16:56:43Z" level=info msg="[core] parsed scheme: \"\"" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] scheme \"\" not registered, fallback to default scheme" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] ccResolverWrapper: sending update to cc: {[{localhost:50300  <nil> 0 <nil>}] <nil> <nil>}" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] ClientConn switching balancer to \"round_robin\"" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Channel switches to new LB policy \"round_robin\"" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Subchannel Connectivity change to CONNECTING" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Subchannel picks a new address \"localhost:50300\" to connect" system=system
time="2021-07-01T16:56:43Z" level=info msg="Loading HTTP TLS certs" tlsCA=/certs/ca.crt tlsCertFile=/certs/server.crt tlsKeyFile=/certs/server.key
time="2021-07-01T16:56:43Z" level=info msg="[core] Channel Connectivity change to CONNECTING" system=system
time="2021-07-01T16:56:43Z" level=warning msg="[core] grpc: addrConn.createTransport failed to connect to {localhost:50300 localhost:50300 <nil> 0 <nil>}. Err: connection error: desc = \"transport: Error while dialing dial tcp [::1]:50300: connect: connection refused\". Reconnecting..." system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Subchannel Connectivity change to TRANSIENT_FAILURE" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Channel Connectivity change to TRANSIENT_FAILURE" system=system
time="2021-07-01T16:56:43Z" level=info msg="Starting HTTP/2 server" addr=":50300"
time="2021-07-01T16:56:44Z" level=info msg="[core] Subchannel Connectivity change to CONNECTING" system=system
time="2021-07-01T16:56:44Z" level=info msg="[core] Subchannel picks a new address \"localhost:50300\" to connect" system=system
time="2021-07-01T16:56:44Z" level=info msg="[core] Subchannel Connectivity change to READY" system=system
time="2021-07-01T16:56:44Z" level=info msg="[roundrobin] roundrobinPicker: newPicker called with info: {map[0xc000390cc0:{{localhost:50300  <nil> 0 <nil>}}]}" system=system
time="2021-07-01T16:56:44Z" level=info msg="[core] Channel Connectivity change to READY" system=system
time="2021-07-01T16:56:46Z" level=info msg="Running script" query_id=cf2d358f-d94e-41b9-8cd0-6fbcedf8fc2c
time="2021-07-01T16:56:46Z" level=info msg="Executed query" duration=3.515362ms query_id=cf2d358f-d94e-41b9-8cd0-6fbcedf8fc2c
time="2021-07-01T16:56:46Z" level=info msg="Received unhealthy heath check result: results not returned on health check for query ID cf2d358f-d94e-41b9-8cd0-6fbcedf8fc2c"
time="2021-07-01T16:56:48Z" level=info msg="Clearing distributed state"

App information:

  • Pixie version gcr.io/pixie-oss/pixie-prod/vizier/pem_image:0.7.17
  • v1.19.10-gke.1600 asmv : 1-9-5-asm-2

Additional context Debugging the issue in slack we've found that the qb-wait container is failing DNS lookup

curl: (6) Could not resolve host: vizier-query-broker

Changing hostNetwork: false and dnsPolicy: ClusterFirst in the vizier-pem daemon set and everything started and seemed to work fine. It was suggested that this is a hack and probably breaks some Pixie functionality. We are also engaging with Google on this problem and will keep this issue updated with that information.

zackb avatar Jul 02 '21 17:07 zackb