jaeger-operator icon indicating copy to clipboard operation
jaeger-operator copied to clipboard

Pod jaeger-query is not running

Open csh995426531 opened this issue 2 years ago • 4 comments

Describe the bug jaeger-query Pod is not running

To Reproduce Steps to reproduce the behavior:

  1. I use jaeger operator to deploy the program

Expected behavior Program running

Screenshots Here is the POD logs. 022/04/07 14:49:57 maxprocs: Leaving GOMAXPROCS=2: CPU quota undefined {"level":"info","ts":1649342997.482264,"caller":"flags/service.go:117","msg":"Mounting metrics handler on admin server","route":"/metrics"} {"level":"info","ts":1649342997.482329,"caller":"flags/service.go:123","msg":"Mounting expvar handler on admin server","route":"/debug/vars"} {"level":"info","ts":1649342997.4824343,"caller":"flags/admin.go:104","msg":"Mounting health check on admin server","route":"/"} {"level":"info","ts":1649342997.48248,"caller":"flags/admin.go:115","msg":"Starting admin HTTP server","http-addr":":16687"} {"level":"info","ts":1649342997.4824934,"caller":"flags/admin.go:96","msg":"Admin server started","http.host-port":"[::]:16687","health-status":"unavailable"} {"level":"info","ts":1649342997.5168622,"caller":"config/config.go:205","msg":"Elasticsearch detected","version":7} {"level":"info","ts":1649342997.5171723,"caller":"channelz/logging.go:50","msg":"[core]original dial target is: \":16685\"","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.51721,"caller":"channelz/logging.go:50","msg":"[core]dial target \":16685\" parse failed: parse \":16685\": missing protocol scheme","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5172195,"caller":"channelz/logging.go:50","msg":"[core]fallback to scheme \"passthrough\"","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5172565,"caller":"channelz/logging.go:50","msg":"[core]parsed dial target is: {Scheme:passthrough Authority: Endpoint::16685 URL:{Scheme:passthrough Opaque: User: Host: Path:/:16685 RawPath: ForceQuery:false RawQuery: Fragment: RawFragment:}}","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5172687,"caller":"channelz/logging.go:50","msg":"[core]Channel authority set to \"localhost:16685\"","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5173368,"caller":"channelz/logging.go:50","msg":"[core]ccResolverWrapper: sending update to cc: {[{:16685 <nil> <nil> 0 <nil>}] <nil> <nil>}","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5173564,"caller":"channelz/logging.go:50","msg":"[core]ClientConn switching balancer to \"pick_first\"","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5173945,"caller":"channelz/logging.go:50","msg":"[core]Channel switches to new LB policy \"pick_first\"","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5175223,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to CONNECTING","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5176039,"caller":"channelz/logging.go:50","msg":"[core]Subchannel picks a new address \":16685\" to connect","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5178304,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0003e3390, {CONNECTING <nil>}","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5179038,"caller":"channelz/logging.go:50","msg":"[core]Channel Connectivity change to CONNECTING","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.5179644,"caller":"app/static_handler.go:182","msg":"UI config path not provided, config file will not be watched"} {"level":"warn","ts":1649342997.5180821,"caller":"channelz/logging.go:75","msg":"[core]grpc: addrConn.createTransport failed to connect to {:16685 localhost:16685 <nil> <nil> 0 <nil>}. Err: connection error: desc = \"transport: Error while dialing dial tcp :16685: connect: connection refused\"","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.518182,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to TRANSIENT_FAILURE","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.518183,"caller":"app/server.go:200","msg":"Query server started","http_addr":"[::]:16686","grpc_addr":"[::]:16685"} {"level":"info","ts":1649342997.5182555,"caller":"healthcheck/handler.go:129","msg":"Health Check state change","status":"ready"} {"level":"info","ts":1649342997.5182922,"caller":"app/server.go:283","msg":"Starting GRPC server","port":16685,"addr":":16685"} {"level":"info","ts":1649342997.5183485,"caller":"app/server.go:264","msg":"Starting HTTP server","port":16686,"addr":":16686"} {"level":"info","ts":1649342997.518425,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0003e3390, {TRANSIENT_FAILURE connection error: desc = \"transport: Error while dialing dial tcp :16685: connect: connection refused\"}","system":"grpc","grpc_log":true} {"level":"info","ts":1649342997.518441,"caller":"channelz/logging.go:50","msg":"[core]Channel Connectivity change to TRANSIENT_FAILURE","system":"grpc","grpc_log":true} {"level":"info","ts":1649342998.5184693,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to IDLE","system":"grpc","grpc_log":true} {"level":"info","ts":1649342998.5185344,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0003e3390, {IDLE connection error: desc = \"transport: Error while dialing dial tcp :16685: connect: connection refused\"}","system":"grpc","grpc_log":true} {"level":"info","ts":1649342998.5185556,"caller":"channelz/logging.go:50","msg":"[core]Channel Connectivity change to IDLE","system":"grpc","grpc_log":true} {"level":"info","ts":1649343000.410391,"caller":"flags/service.go:154","msg":"Shutting down"} {"level":"info","ts":1649343000.4104307,"caller":"healthcheck/handler.go:129","msg":"Health Check state change","status":"unavailable"} {"level":"info","ts":1649343000.4105563,"caller":"channelz/logging.go:50","msg":"[core]Channel Connectivity change to SHUTDOWN","system":"grpc","grpc_log":true} {"level":"info","ts":1649343000.4106004,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to SHUTDOWN","system":"grpc","grpc_log":true} {"level":"info","ts":1649343000.4116342,"caller":"flags/service.go:162","msg":"Shutdown complete"}

Version (please complete the following information):

  • OS: [Kubernetes v1.18.8-aliyun.1 Docker 19.3.5]
  • Jaeger version: [jaeger-operator v1.32.0]
  • Deployment: Kubernetes

What troubleshooting steps did you try? Try to follow https://www.jaegertracing.io/docs/latest/troubleshooting/ and describe how far you were able to progress and/or which steps did not work.

Additional context This jaeger deploy config file

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger-prod
spec:
  strategy: production
  storage:
    type: elasticsearch # <1>
    options:
      es:
        server-urls: http://es-cn-09k24d95u00723adh.elasticsearch.aliyuncs.com:9200 # <2>
        index-prefix: jaeger
        use-aliases: true
        # tls: # <3>
        #   ca: /es/certificates/ca.crt
    secretName: jaeger-secret # <4>
    esIndexCleaner:
      enabled: true 
      numberOfDays: 7
      schedule: "55 23 * * *"
    esRollover:
      conditions: "{\"max_age\": \"1d\"}"
      readTTL: 120h
      schedule: "55 23 * * *"
    dependencies:
      enabled: true                                 # turn the job deployment on and off
      schedule: "55 23 * * *"                       # cron expression for it to run
      sparkMaster:                                  # spark master connection string, when empty spark runs in embedded local mode
      resources:
        requests:
          memory: 4096Mi
        limits:
          memory: 4096Mi
  collector:
    maxReplicas: 3
    resources:
      limits:
        cpu: 200m
        memory: 256Mi

Here is the pod information viewed with the kubectl describe pod command:

Controlled By:  ReplicaSet/jaeger-prod-query-5c8d869ff8
Containers:
  jaeger-query:
    Container ID:  docker://4aa2239faa597680eb2490045da76052d31cf82ba689deded3d5d060d8080d6a
    Image:         jaegertracing/jaeger-query:1.32.0
    Image ID:      docker-pullable://jaegertracing/jaeger-query@sha256:075e2e99c2bf1c868994be4973390bf29d0d81db0a8a2673e3de3fad3ff989e4
    Ports:         16686/TCP, 16687/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      --es.index-prefix=jaeger
      --es.server-urls=http://es-cn-09k24d95u00723adh.elasticsearch.aliyuncs.com:9200
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 08 Apr 2022 14:37:45 +0800
      Finished:     Fri, 08 Apr 2022 14:37:48 +0800
    Ready:          False
    Restart Count:  0
    Liveness:       http-get http://:16687/ delay=5s timeout=1s period=15s #success=1 #failure=5
    Readiness:      http-get http://:16687/ delay=1s timeout=1s period=10s #success=1 #failure=3
    Environment Variables from:
      jaeger-secret  Secret  Optional: false
    Environment:
      SPAN_STORAGE_TYPE:     elasticsearch
      METRICS_STORAGE_TYPE:
      JAEGER_DISABLED:       false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from jaeger-prod-token-62v7t (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  jaeger-prod-token-62v7t:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  jaeger-prod-token-62v7t
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  4s    default-scheduler  Successfully assigned default/jaeger-prod-query-5c8d869ff8-5nbfl to cn-hangzhou.172.16.159.179
  Normal  Pulled     4s    kubelet            Container image "jaegertracing/jaeger-query:1.32.0" already present on machine
  Normal  Created    4s    kubelet            Created container jaeger-query
  Normal  Started    4s    kubelet            Started container jaeger-query
  Normal  Killing    1s    kubelet            Stopping container jaeger-query

csh995426531 avatar Apr 07 '22 15:04 csh995426531

Hello,

Could you attach your CRD?

rubenvp8510 avatar Apr 08 '22 00:04 rubenvp8510

Hello,

Could you attach your CRD?

Hi, I've added the configuration content. After a night, the POD would occasionally stay running, but after a while the program would end and restart again.

csh995426531 avatar Apr 08 '22 02:04 csh995426531

After regressing from the latest version to 1.24 again and again, it finally works stably

csh995426531 avatar Apr 09 '22 13:04 csh995426531

I got this issue as well. I am running all-in-one Jaeger on Autopilot GKE. It works on local Docker Desktop Kubernetes environment.

Debug logs from short-lived pod:

2022-04-09T17:44:41.338740516Z | info | Mounting metrics handler on admin server
2022-04-09T17:44:41.338896606Z | info | Mounting expvar handler on admin server
2022-04-09T17:44:41.339399976Z | info | Mounting health check on admin server
2022-04-09T17:44:41.339600726Z | info | Starting admin HTTP server
2022-04-09T17:44:41.339692019Z | info | Admin server started
2022-04-09T17:44:41.341467906Z | info | Memory storage initialized
2022-04-09T17:44:41.341792899Z | info | Loading sampling strategies
2022-04-09T17:44:41.353948085Z | info | Starting jaeger-collector gRPC server
2022-04-09T17:44:41.354133106Z | info | Starting jaeger-collector HTTP server
2022-04-09T17:44:41.354490409Z | info | Listening for Zipkin HTTP traffic
2022-04-09T17:44:41.433798474Z | info | Agent requested insecure grpc connection to collector(s)
2022-04-09T17:44:41.433846613Z | info | [core]original dial target is: ":14250"
2022-04-09T17:44:41.433943841Z | info | [core]dial target ":14250" parse failed: parse ":14250": missing protocol scheme
2022-04-09T17:44:41.433969048Z | info | [core]fallback to scheme "passthrough"
2022-04-09T17:44:41.433977083Z | info | [core]parsed dial target is: {Scheme:passthrough Authority: Endpoint::14250 URL:{Scheme:passthrough Opaque: User: Host: Path:/:14250 RawPath: ForceQuery:false RawQuery: Fragment: RawFragment:}}
2022-04-09T17:44:41.433984656Z | info | [core]Channel authority set to "localhost:14250"
2022-04-09T17:44:41.433998974Z | info | [core]ccResolverWrapper: sending update to cc: {[{:14250  <nil> <nil> 0 <nil>}] <nil> <nil>}
2022-04-09T17:44:41.434098024Z | info | [core]ClientConn switching balancer to "round_robin"
2022-04-09T17:44:41.434111125Z | info | [core]Channel switches to new LB policy "round_robin"
2022-04-09T17:44:41.434187884Z | info | [balancer]base.baseBalancer: got new ClientConn state: {{[{:14250  <nil> <nil> 0 <nil>}] <nil> <nil>} <nil>}
2022-04-09T17:44:41.435096116Z | info | Starting agent
2022-04-09T17:44:41.435183639Z | info | Archive storage not created
2022-04-09T17:44:41.435194740Z | info | Archive storage not initialized
2022-04-09T17:44:41.436670659Z | info | [core]original dial target is: ":16685"
2022-04-09T17:44:41.436743137Z | info | [core]dial target ":16685" parse failed: parse ":16685": missing protocol scheme
2022-04-09T17:44:41.436755457Z | info | [core]fallback to scheme "passthrough"
2022-04-09T17:44:41.436762415Z | info | [core]parsed dial target is: {Scheme:passthrough Authority: Endpoint::16685 URL:{Scheme:passthrough Opaque: User: Host: Path:/:16685 RawPath: ForceQuery:false RawQuery: Fragment: RawFragment:}}
2022-04-09T17:44:41.436770417Z | info | [core]Channel authority set to "localhost:16685"
2022-04-09T17:44:41.436787323Z | info | [core]ccResolverWrapper: sending update to cc: {[{:16685  <nil> <nil> 0 <nil>}] <nil> <nil>}
2022-04-09T17:44:41.436794551Z | info | [core]ClientConn switching balancer to "pick_first"
2022-04-09T17:44:41.436800440Z | info | [core]Channel switches to new LB policy "pick_first"
2022-04-09T17:44:41.436816849Z | info | UI config path not provided, config file will not be watched
2022-04-09T17:44:41.436824065Z | info | Query server started
2022-04-09T17:44:41.436831497Z | info | Health Check state change
2022-04-09T17:44:41.436838347Z | info | Starting GRPC server
2022-04-09T17:44:41.436954273Z | info | [core]Subchannel Connectivity change to CONNECTING
2022-04-09T17:44:41.436966942Z | info | [core]Subchannel picks a new address ":14250" to connect
2022-04-09T17:44:41.436974953Z | info | Checking connection to collector
2022-04-09T17:44:41.436981043Z | info | Agent collector connection state change
2022-04-09T17:44:41.437067350Z | info | Starting jaeger-agent HTTP server
2022-04-09T17:44:41.437216740Z | info | [core]Subchannel Connectivity change to CONNECTING
2022-04-09T17:44:41.437281457Z | info | [core]Subchannel picks a new address ":16685" to connect
2022-04-09T17:44:41.437415561Z | info | Starting HTTP server
2022-04-09T17:44:41.437446617Z | info | [balancer]base.baseBalancer: handle SubConn state change: 0xc0002fd530, CONNECTING
2022-04-09T17:44:41.437492715Z | info | [core]Channel Connectivity change to CONNECTING
2022-04-09T17:44:41.437556417Z | info | Agent collector connection state change
2022-04-09T17:44:41.437692674Z | info | [core]pickfirstBalancer: UpdateSubConnState: 0xc00069d370, {CONNECTING <nil>}
2022-04-09T17:44:41.437712392Z | info | [core]Channel Connectivity change to CONNECTING
2022-04-09T17:44:41.438099601Z | info | [core]Subchannel Connectivity change to READY
2022-04-09T17:44:41.438118438Z | info | [balancer]base.baseBalancer: handle SubConn state change: 0xc0002fd530, READY
2022-04-09T17:44:41.438208619Z | info | [roundrobin]roundrobinPicker: Build called with info: {map[0xc0002fd530:{{:14250  <nil> <nil> 0 <nil>}}]}
2022-04-09T17:44:41.438223493Z | info | [core]Channel Connectivity change to READY
2022-04-09T17:44:41.438230712Z | info | Agent collector connection state change
2022-04-09T17:44:41.438695906Z | info | [core]Subchannel Connectivity change to READY
2022-04-09T17:44:41.438714964Z | info | [core]pickfirstBalancer: UpdateSubConnState: 0xc00069d370, {READY <nil>}
2022-04-09T17:44:41.438731255Z | info | [core]Channel Connectivity change to READY
2022-04-09T17:44:42.914397341Z | info | Shutting down
2022-04-09T17:44:42.914474372Z | info | Health Check state change
2022-04-09T17:44:42.914488672Z | info | shutting down agent's HTTP server
2022-04-09T17:44:42.914496768Z | info | agent's http server exiting
2022-04-09T17:44:42.916492294Z | info | [core]Channel Connectivity change to SHUTDOWN
2022-04-09T17:44:42.916572400Z | info | [core]Subchannel Connectivity change to SHUTDOWN
2022-04-09T17:44:42.916581529Z | info | Agent collector connection state change
2022-04-09T17:44:42.916589135Z | info | [transport]transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2022-04-09T17:44:42.916600769Z | info | Health Check state change
2022-04-09T17:44:42.916611352Z | info | [transport]transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2022-04-09T17:44:42.917655090Z | info | Health Check state change
2022-04-09T17:44:42.919437631Z | warn | [transport]transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:16685->127.0.0.1:41504: use of closed network connection
2022-04-09T17:44:42.919477942Z | debug | closing tracer
2022-04-09T17:44:42.919486974Z | debug | closing reporter
2022-04-09T17:44:42.919495595Z | info | [core]Channel Connectivity change to SHUTDOWN
2022-04-09T17:44:42.919514005Z | info | [core]Subchannel Connectivity change to SHUTDOWN
2022-04-09T17:44:42.919521825Z | info | [transport]transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2022-04-09T17:44:42.919531024Z | info | [transport]transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2022-04-09T17:44:42.923884407Z | info | Shutdown complete

cathei avatar Apr 09 '22 17:04 cathei

This seems to be an issue with the all-in-one image in general, not specific to jaeger-operator.

SOF3 avatar Oct 06 '22 08:10 SOF3

try adding port 16685

docker run -d --name jaeger \
  -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \

*-p 16685:16685 *

 -p 4317:4317 \
 -p 4318:4318 \
 -p 14250:14250 \
 -p 14268:14268 \
 -p 14269:14269 \
 -p 9411:9411 \
 jaegertracing/all-in-one:1.38

WorksForMe limited testing

willief avatar Oct 09 '22 06:10 willief

The 16685 port is now exposed properly when using the operator. Closing this since it should not be a problem anymore.

iblancasa avatar Aug 08 '23 11:08 iblancasa