jaeger-operator
jaeger-operator copied to clipboard
Pod jaeger-query is not running
Describe the bug jaeger-query Pod is not running
To Reproduce Steps to reproduce the behavior:
- I use jaeger operator to deploy the program
Expected behavior Program running
Screenshots
Here is the POD logs.
022/04/07 14:49:57 maxprocs: Leaving GOMAXPROCS=2: CPU quota undefined {"level":"info","ts":1649342997.482264,"caller":"flags/service.go:117","msg":"Mounting metrics handler on admin server","route":"/metrics"} {"level":"info","ts":1649342997.482329,"caller":"flags/service.go:123","msg":"Mounting expvar handler on admin server","route":"/debug/vars"} {"level":"info","ts":1649342997.4824343,"caller":"flags/admin.go:104","msg":"Mounting health check on admin server","route":"/"} {"level":"info","ts":1649342997.48248,"caller":"flags/admin.go:115","msg":"Starting admin HTTP server","http-addr":":16687"} {"level":"info","ts":1649342997.4824934,"caller":"flags/admin.go:96","msg":"Admin server started","http.host-port":"[::]:16687","health-status":"unavailable"} {"level":"info","ts":1649342997.5168622,"caller":"config/config.go:205","msg":"Elasticsearch detected","version":7} {"level":"info","ts":1649342997.5171723,"caller":"channelz/logging.go:50","msg":"[core]original dial target is: \":16685\"","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.51721,"caller":"channelz/logging.go:50","msg":"[core]dial target \":16685\" parse failed: parse \":16685\": missing protocol scheme","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5172195,"caller":"channelz/logging.go:50","msg":"[core]fallback to scheme \"passthrough\"","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5172565,"caller":"channelz/logging.go:50","msg":"[core]parsed dial target is: {Scheme:passthrough Authority: Endpoint::16685 URL:{Scheme:passthrough Opaque: User: Host: Path:/:16685 RawPath: ForceQuery:false RawQuery: Fragment: RawFragment:}}","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5172687,"caller":"channelz/logging.go:50","msg":"[core]Channel authority set to \"localhost:16685\"","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5173368,"caller":"channelz/logging.go:50","msg":"[core]ccResolverWrapper: sending update to cc: {[{:16685 <nil> <nil> 0 <nil>}] <nil> <nil>}","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5173564,"caller":"channelz/logging.go:50","msg":"[core]ClientConn switching balancer to \"pick_first\"","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5173945,"caller":"channelz/logging.go:50","msg":"[core]Channel switches to new LB policy \"pick_first\"","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5175223,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to CONNECTING","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5176039,"caller":"channelz/logging.go:50","msg":"[core]Subchannel picks a new address \":16685\" to connect","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5178304,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0003e3390, {CONNECTING <nil>}","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5179038,"caller":"channelz/logging.go:50","msg":"[core]Channel Connectivity change to CONNECTING","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5179644,"caller":"app/static_handler.go:182","msg":"UI config path not provided, config file will not be watched"} {"level":"warn","ts":1649342997.5180821,"caller":"channelz/logging.go:75","msg":"[core]grpc: addrConn.createTransport failed to connect to {:16685 localhost:16685 <nil> <nil> 0 <nil>}. Err: connection error: desc = \"transport: Error while dialing dial tcp :16685: connect: connection refused\"","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.518182,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to TRANSIENT_FAILURE","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.518183,"caller":"app/server.go:200","msg":"Query server started","http_addr":"[::]:16686","grpc_addr":"[::]:16685"} {"level":"info","ts":1649342997.5182555,"caller":"healthcheck/handler.go:129","msg":"Health Check state change","status":"ready"} {"level":"info","ts":1649342997.5182922,"caller":"app/server.go:283","msg":"Starting GRPC server","port":16685,"addr":":16685"} {"level":"info","ts":1649342997.5183485,"caller":"app/server.go:264","msg":"Starting HTTP server","port":16686,"addr":":16686"} {"level":"info","ts":1649342997.518425,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0003e3390, {TRANSIENT_FAILURE connection error: desc = \"transport: Error while dialing dial tcp :16685: connect: connection refused\"}","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.518441,"caller":"channelz/logging.go:50","msg":"[core]Channel Connectivity change to TRANSIENT_FAILURE","system":"grpc","grpc_log":true} {"level":"info","ts":1649342998.5184693,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to IDLE","system":"grpc","grpc_log":true} {"level":"info","ts":1649342998.5185344,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0003e3390, {IDLE connection error: desc = \"transport: Error while dialing dial tcp :16685: connect: connection refused\"}","system":"grpc","grpc_log":true} {"level":"info","ts":1649342998.5185556,"caller":"channelz/logging.go:50","msg":"[core]Channel Connectivity change to IDLE","system":"grpc","grpc_log":true} {"level":"info","ts":1649343000.410391,"caller":"flags/service.go:154","msg":"Shutting down"} {"level":"info","ts":1649343000.4104307,"caller":"healthcheck/handler.go:129","msg":"Health Check state change","status":"unavailable"} {"level":"info","ts":1649343000.4105563,"caller":"channelz/logging.go:50","msg":"[core]Channel Connectivity change to SHUTDOWN","system":"grpc","grpc_log":true} {"level":"info","ts":1649343000.4106004,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to SHUTDOWN","system":"grpc","grpc_log":true} {"level":"info","ts":1649343000.4116342,"caller":"flags/service.go:162","msg":"Shutdown complete"}
Version (please complete the following information):
- OS: [Kubernetes v1.18.8-aliyun.1 Docker 19.3.5]
- Jaeger version: [jaeger-operator v1.32.0]
- Deployment: Kubernetes
What troubleshooting steps did you try? Try to follow https://www.jaegertracing.io/docs/latest/troubleshooting/ and describe how far you were able to progress and/or which steps did not work.
Additional context This jaeger deploy config file
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger-prod
spec:
strategy: production
storage:
type: elasticsearch # <1>
options:
es:
server-urls: http://es-cn-09k24d95u00723adh.elasticsearch.aliyuncs.com:9200 # <2>
index-prefix: jaeger
use-aliases: true
# tls: # <3>
# ca: /es/certificates/ca.crt
secretName: jaeger-secret # <4>
esIndexCleaner:
enabled: true
numberOfDays: 7
schedule: "55 23 * * *"
esRollover:
conditions: "{\"max_age\": \"1d\"}"
readTTL: 120h
schedule: "55 23 * * *"
dependencies:
enabled: true # turn the job deployment on and off
schedule: "55 23 * * *" # cron expression for it to run
sparkMaster: # spark master connection string, when empty spark runs in embedded local mode
resources:
requests:
memory: 4096Mi
limits:
memory: 4096Mi
collector:
maxReplicas: 3
resources:
limits:
cpu: 200m
memory: 256Mi
Here is the pod information viewed with the kubectl describe pod command:
Controlled By: ReplicaSet/jaeger-prod-query-5c8d869ff8
Containers:
jaeger-query:
Container ID: docker://4aa2239faa597680eb2490045da76052d31cf82ba689deded3d5d060d8080d6a
Image: jaegertracing/jaeger-query:1.32.0
Image ID: docker-pullable://jaegertracing/jaeger-query@sha256:075e2e99c2bf1c868994be4973390bf29d0d81db0a8a2673e3de3fad3ff989e4
Ports: 16686/TCP, 16687/TCP
Host Ports: 0/TCP, 0/TCP
Args:
--es.index-prefix=jaeger
--es.server-urls=http://es-cn-09k24d95u00723adh.elasticsearch.aliyuncs.com:9200
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 08 Apr 2022 14:37:45 +0800
Finished: Fri, 08 Apr 2022 14:37:48 +0800
Ready: False
Restart Count: 0
Liveness: http-get http://:16687/ delay=5s timeout=1s period=15s #success=1 #failure=5
Readiness: http-get http://:16687/ delay=1s timeout=1s period=10s #success=1 #failure=3
Environment Variables from:
jaeger-secret Secret Optional: false
Environment:
SPAN_STORAGE_TYPE: elasticsearch
METRICS_STORAGE_TYPE:
JAEGER_DISABLED: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from jaeger-prod-token-62v7t (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
jaeger-prod-token-62v7t:
Type: Secret (a volume populated by a Secret)
SecretName: jaeger-prod-token-62v7t
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4s default-scheduler Successfully assigned default/jaeger-prod-query-5c8d869ff8-5nbfl to cn-hangzhou.172.16.159.179
Normal Pulled 4s kubelet Container image "jaegertracing/jaeger-query:1.32.0" already present on machine
Normal Created 4s kubelet Created container jaeger-query
Normal Started 4s kubelet Started container jaeger-query
Normal Killing 1s kubelet Stopping container jaeger-query
Hello,
Could you attach your CRD?
Hello,
Could you attach your CRD?
Hi, I've added the configuration content. After a night, the POD would occasionally stay running, but after a while the program would end and restart again.
After regressing from the latest version to 1.24 again and again, it finally works stably
I got this issue as well. I am running all-in-one Jaeger on Autopilot GKE. It works on local Docker Desktop Kubernetes environment.
Debug logs from short-lived pod:
2022-04-09T17:44:41.338740516Z | info | Mounting metrics handler on admin server
2022-04-09T17:44:41.338896606Z | info | Mounting expvar handler on admin server
2022-04-09T17:44:41.339399976Z | info | Mounting health check on admin server
2022-04-09T17:44:41.339600726Z | info | Starting admin HTTP server
2022-04-09T17:44:41.339692019Z | info | Admin server started
2022-04-09T17:44:41.341467906Z | info | Memory storage initialized
2022-04-09T17:44:41.341792899Z | info | Loading sampling strategies
2022-04-09T17:44:41.353948085Z | info | Starting jaeger-collector gRPC server
2022-04-09T17:44:41.354133106Z | info | Starting jaeger-collector HTTP server
2022-04-09T17:44:41.354490409Z | info | Listening for Zipkin HTTP traffic
2022-04-09T17:44:41.433798474Z | info | Agent requested insecure grpc connection to collector(s)
2022-04-09T17:44:41.433846613Z | info | [core]original dial target is: ":14250"
2022-04-09T17:44:41.433943841Z | info | [core]dial target ":14250" parse failed: parse ":14250": missing protocol scheme
2022-04-09T17:44:41.433969048Z | info | [core]fallback to scheme "passthrough"
2022-04-09T17:44:41.433977083Z | info | [core]parsed dial target is: {Scheme:passthrough Authority: Endpoint::14250 URL:{Scheme:passthrough Opaque: User: Host: Path:/:14250 RawPath: ForceQuery:false RawQuery: Fragment: RawFragment:}}
2022-04-09T17:44:41.433984656Z | info | [core]Channel authority set to "localhost:14250"
2022-04-09T17:44:41.433998974Z | info | [core]ccResolverWrapper: sending update to cc: {[{:14250 <nil> <nil> 0 <nil>}] <nil> <nil>}
2022-04-09T17:44:41.434098024Z | info | [core]ClientConn switching balancer to "round_robin"
2022-04-09T17:44:41.434111125Z | info | [core]Channel switches to new LB policy "round_robin"
2022-04-09T17:44:41.434187884Z | info | [balancer]base.baseBalancer: got new ClientConn state: {{[{:14250 <nil> <nil> 0 <nil>}] <nil> <nil>} <nil>}
2022-04-09T17:44:41.435096116Z | info | Starting agent
2022-04-09T17:44:41.435183639Z | info | Archive storage not created
2022-04-09T17:44:41.435194740Z | info | Archive storage not initialized
2022-04-09T17:44:41.436670659Z | info | [core]original dial target is: ":16685"
2022-04-09T17:44:41.436743137Z | info | [core]dial target ":16685" parse failed: parse ":16685": missing protocol scheme
2022-04-09T17:44:41.436755457Z | info | [core]fallback to scheme "passthrough"
2022-04-09T17:44:41.436762415Z | info | [core]parsed dial target is: {Scheme:passthrough Authority: Endpoint::16685 URL:{Scheme:passthrough Opaque: User: Host: Path:/:16685 RawPath: ForceQuery:false RawQuery: Fragment: RawFragment:}}
2022-04-09T17:44:41.436770417Z | info | [core]Channel authority set to "localhost:16685"
2022-04-09T17:44:41.436787323Z | info | [core]ccResolverWrapper: sending update to cc: {[{:16685 <nil> <nil> 0 <nil>}] <nil> <nil>}
2022-04-09T17:44:41.436794551Z | info | [core]ClientConn switching balancer to "pick_first"
2022-04-09T17:44:41.436800440Z | info | [core]Channel switches to new LB policy "pick_first"
2022-04-09T17:44:41.436816849Z | info | UI config path not provided, config file will not be watched
2022-04-09T17:44:41.436824065Z | info | Query server started
2022-04-09T17:44:41.436831497Z | info | Health Check state change
2022-04-09T17:44:41.436838347Z | info | Starting GRPC server
2022-04-09T17:44:41.436954273Z | info | [core]Subchannel Connectivity change to CONNECTING
2022-04-09T17:44:41.436966942Z | info | [core]Subchannel picks a new address ":14250" to connect
2022-04-09T17:44:41.436974953Z | info | Checking connection to collector
2022-04-09T17:44:41.436981043Z | info | Agent collector connection state change
2022-04-09T17:44:41.437067350Z | info | Starting jaeger-agent HTTP server
2022-04-09T17:44:41.437216740Z | info | [core]Subchannel Connectivity change to CONNECTING
2022-04-09T17:44:41.437281457Z | info | [core]Subchannel picks a new address ":16685" to connect
2022-04-09T17:44:41.437415561Z | info | Starting HTTP server
2022-04-09T17:44:41.437446617Z | info | [balancer]base.baseBalancer: handle SubConn state change: 0xc0002fd530, CONNECTING
2022-04-09T17:44:41.437492715Z | info | [core]Channel Connectivity change to CONNECTING
2022-04-09T17:44:41.437556417Z | info | Agent collector connection state change
2022-04-09T17:44:41.437692674Z | info | [core]pickfirstBalancer: UpdateSubConnState: 0xc00069d370, {CONNECTING <nil>}
2022-04-09T17:44:41.437712392Z | info | [core]Channel Connectivity change to CONNECTING
2022-04-09T17:44:41.438099601Z | info | [core]Subchannel Connectivity change to READY
2022-04-09T17:44:41.438118438Z | info | [balancer]base.baseBalancer: handle SubConn state change: 0xc0002fd530, READY
2022-04-09T17:44:41.438208619Z | info | [roundrobin]roundrobinPicker: Build called with info: {map[0xc0002fd530:{{:14250 <nil> <nil> 0 <nil>}}]}
2022-04-09T17:44:41.438223493Z | info | [core]Channel Connectivity change to READY
2022-04-09T17:44:41.438230712Z | info | Agent collector connection state change
2022-04-09T17:44:41.438695906Z | info | [core]Subchannel Connectivity change to READY
2022-04-09T17:44:41.438714964Z | info | [core]pickfirstBalancer: UpdateSubConnState: 0xc00069d370, {READY <nil>}
2022-04-09T17:44:41.438731255Z | info | [core]Channel Connectivity change to READY
2022-04-09T17:44:42.914397341Z | info | Shutting down
2022-04-09T17:44:42.914474372Z | info | Health Check state change
2022-04-09T17:44:42.914488672Z | info | shutting down agent's HTTP server
2022-04-09T17:44:42.914496768Z | info | agent's http server exiting
2022-04-09T17:44:42.916492294Z | info | [core]Channel Connectivity change to SHUTDOWN
2022-04-09T17:44:42.916572400Z | info | [core]Subchannel Connectivity change to SHUTDOWN
2022-04-09T17:44:42.916581529Z | info | Agent collector connection state change
2022-04-09T17:44:42.916589135Z | info | [transport]transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2022-04-09T17:44:42.916600769Z | info | Health Check state change
2022-04-09T17:44:42.916611352Z | info | [transport]transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2022-04-09T17:44:42.917655090Z | info | Health Check state change
2022-04-09T17:44:42.919437631Z | warn | [transport]transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:16685->127.0.0.1:41504: use of closed network connection
2022-04-09T17:44:42.919477942Z | debug | closing tracer
2022-04-09T17:44:42.919486974Z | debug | closing reporter
2022-04-09T17:44:42.919495595Z | info | [core]Channel Connectivity change to SHUTDOWN
2022-04-09T17:44:42.919514005Z | info | [core]Subchannel Connectivity change to SHUTDOWN
2022-04-09T17:44:42.919521825Z | info | [transport]transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2022-04-09T17:44:42.919531024Z | info | [transport]transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2022-04-09T17:44:42.923884407Z | info | Shutdown complete
This seems to be an issue with the all-in-one image in general, not specific to jaeger-operator.
try adding port 16685
docker run -d --name jaeger \
-e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
-e COLLECTOR_OTLP_ENABLED=true \
-p 6831:6831/udp \
-p 6832:6832/udp \
-p 5778:5778 \
-p 16686:16686 \
*-p 16685:16685 *
-p 4317:4317 \
-p 4318:4318 \
-p 14250:14250 \
-p 14268:14268 \
-p 14269:14269 \
-p 9411:9411 \
jaegertracing/all-in-one:1.38
WorksForMe limited testing
The 16685
port is now exposed properly when using the operator. Closing this since it should not be a problem anymore.