pixie
pixie copied to clipboard
Support for Anthos
Describe the bug When installing pixie on an Anthos Google cluster, the PEM pods fail to initialize due to some DNS problem.
To Reproduce Steps to reproduce the behavior:
- Anthos cluster version v1.19.10-gke.1600 asmv : 1-9-5-asm-2
- run
px deploy
on the cluster - Notice all of the PEM pods are stuck in
Init:0/1
forever state and the deployment times out
Expected behavior Pixie installs like it does on vanilla GKE
Screenshots If applicable, add screenshots to help explain your problem. Please make sure the screenshot does not contain any sensitive information such as API keys or access tokens.
Logs
time="2021-07-01T16:56:43Z" level=info msg="Starting service" service=query-broker version=0.7.17+Distribution.801d7bf.20210628182125.1
time="2021-07-01T16:56:43Z" level=info msg="Loading HTTP TLS certs" tlsCA=/certs/ca.crt tlsCertFile=/certs/client.crt tlsKeyFile=/certs/client.key
time="2021-07-01T16:56:43Z" level=info msg="[core] parsed scheme: \"\"" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] scheme \"\" not registered, fallback to default scheme" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] ccResolverWrapper: sending update to cc: {[{vizier-metadata.pl.svc:50400 <nil> 0 <nil>}] <nil> <nil>}" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] ClientConn switching balancer to \"round_robin\"" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Channel switches to new LB policy \"round_robin\"" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Subchannel Connectivity change to CONNECTING" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Subchannel picks a new address \"vizier-metadata.pl.svc:50400\" to connect" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Channel Connectivity change to CONNECTING" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Subchannel Connectivity change to READY" system=system
time="2021-07-01T16:56:43Z" level=info msg="[roundrobin] roundrobinPicker: newPicker called with info: {map[0xc0002257b0:{{vizier-metadata.pl.svc:50400 <nil> 0 <nil>}}]}" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Channel Connectivity change to READY" system=system
time="2021-07-01T16:56:43Z" level=info msg="Loading HTTP TLS certs" tlsCA=/certs/ca.crt tlsCertFile=/certs/client.crt tlsKeyFile=/certs/client.key
time="2021-07-01T16:56:43Z" level=info msg="[core] parsed scheme: \"\"" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] scheme \"\" not registered, fallback to default scheme" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] ccResolverWrapper: sending update to cc: {[{localhost:50300 <nil> 0 <nil>}] <nil> <nil>}" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] ClientConn switching balancer to \"round_robin\"" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Channel switches to new LB policy \"round_robin\"" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Subchannel Connectivity change to CONNECTING" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Subchannel picks a new address \"localhost:50300\" to connect" system=system
time="2021-07-01T16:56:43Z" level=info msg="Loading HTTP TLS certs" tlsCA=/certs/ca.crt tlsCertFile=/certs/server.crt tlsKeyFile=/certs/server.key
time="2021-07-01T16:56:43Z" level=info msg="[core] Channel Connectivity change to CONNECTING" system=system
time="2021-07-01T16:56:43Z" level=warning msg="[core] grpc: addrConn.createTransport failed to connect to {localhost:50300 localhost:50300 <nil> 0 <nil>}. Err: connection error: desc = \"transport: Error while dialing dial tcp [::1]:50300: connect: connection refused\". Reconnecting..." system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Subchannel Connectivity change to TRANSIENT_FAILURE" system=system
time="2021-07-01T16:56:43Z" level=info msg="[core] Channel Connectivity change to TRANSIENT_FAILURE" system=system
time="2021-07-01T16:56:43Z" level=info msg="Starting HTTP/2 server" addr=":50300"
time="2021-07-01T16:56:44Z" level=info msg="[core] Subchannel Connectivity change to CONNECTING" system=system
time="2021-07-01T16:56:44Z" level=info msg="[core] Subchannel picks a new address \"localhost:50300\" to connect" system=system
time="2021-07-01T16:56:44Z" level=info msg="[core] Subchannel Connectivity change to READY" system=system
time="2021-07-01T16:56:44Z" level=info msg="[roundrobin] roundrobinPicker: newPicker called with info: {map[0xc000390cc0:{{localhost:50300 <nil> 0 <nil>}}]}" system=system
time="2021-07-01T16:56:44Z" level=info msg="[core] Channel Connectivity change to READY" system=system
time="2021-07-01T16:56:46Z" level=info msg="Running script" query_id=cf2d358f-d94e-41b9-8cd0-6fbcedf8fc2c
time="2021-07-01T16:56:46Z" level=info msg="Executed query" duration=3.515362ms query_id=cf2d358f-d94e-41b9-8cd0-6fbcedf8fc2c
time="2021-07-01T16:56:46Z" level=info msg="Received unhealthy heath check result: results not returned on health check for query ID cf2d358f-d94e-41b9-8cd0-6fbcedf8fc2c"
time="2021-07-01T16:56:48Z" level=info msg="Clearing distributed state"
App information:
- Pixie version gcr.io/pixie-oss/pixie-prod/vizier/pem_image:0.7.17
- v1.19.10-gke.1600 asmv : 1-9-5-asm-2
Additional context
Debugging the issue in slack we've found that the qb-wait
container is failing DNS lookup
curl: (6) Could not resolve host: vizier-query-broker
Changing hostNetwork: false
and dnsPolicy: ClusterFirst
in the vizier-pem daemon set and everything started and seemed to work fine. It was suggested that this is a hack and probably breaks some Pixie functionality.
We are also engaging with Google on this problem and will keep this issue updated with that information.