jaeger-clickhouse icon indicating copy to clipboard operation
jaeger-clickhouse copied to clipboard

Got plugin error "transport: error while dialing: dial unix /tmp/plugin"

Open bocharovf opened this issue 2 years ago • 6 comments

Describe the bug Got error in Jaeger UI with Clickhouse gRPC plugin when search for traces: HTTP Error: plugin error: rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin2381205323: connect: connection refused

Seems it happens

  • either after several hours of inactivity of Jaeger Query
  • either after jaeger_index_local exceeds ~70kk rows

Clickhouse is up and running. Restarting Jaeger Query fix the problem temporary (until next long search).

To Reproduce Steps to reproduce the behavior:

  1. Ingest ~70kk rows in jaeger_index_local
  2. Search for traces

Expected behavior Traces are found

Screenshots image

Version (please complete the following information):

  • OS: Linux
  • Jaeger Query version: 1.25
  • Clickhouse plugin versin: 0.8
  • Clickhouse version: 21.8.10.19
  • Deployment: Kubernetes

What troubleshooting steps did you try?

Additional context jaeger-query-clickhouse-5ff64c9dbc-h7jr4.log

bocharovf avatar Nov 26 '21 21:11 bocharovf

Just measured: error occurs after 16.7s of trace searching.

In Clickhouse error log: 2021.11.27 11:10:51.904396 [ 239 ] {5407e9ea-5b0a-4307-b4ba-645e44462eeb} <Error> TCPHandler: Code: 210, e.displayText() = DB::Exception: Connection reset by peer, while writing to socket (_10.X.X.X_:61097), Stack trace: ... 2021.11.27 11:10:51.904562 [ 239 ] {5407e9ea-5b0a-4307-b4ba-645e44462eeb} <Warning> TCPHandler: Client has gone away.

bocharovf avatar Nov 27 '21 11:11 bocharovf

Related to #100 . Seems that huge traces (100k+ spans) leads to timeout. Ability to limit fetched spans could partially help.

bocharovf avatar Nov 28 '21 19:11 bocharovf

From the comments it sounds like this is caused by (and resolved by) #100

nickbp avatar Jan 26 '22 01:01 nickbp

Unfortunately #100 helped only partially. There are still errors on our test server. I guess that we incorrectly handle query timeout or connection drops with CH. After error occurs plugin stop working at all and all further requests are failed. So that is severe problem for us.

bocharovf avatar Jan 27 '22 10:01 bocharovf

I've faced with this issue as well. I believe it's related to resources. When querying a huge bunch of data, jaeger starts to consume memory avalanche-like (see pic) Screenshot Sometimes OOMKill doesn't occur, because when the pod almost hits the memory request value the plugin process just killed, and then memory flushed. As a result, the pod is not restarted and the plugin doesn't work. I attached a memory graph where clearly you may see that the pod almost hit the request in 1Gb then memory was flushed and error transport: error while dialing: dial unix /tmp/plugin... connection refused appeared and 2022-02-11T23:05:17.578Z [DEBUG] plugin process exited: path=/plugin/jaeger-clickhouse pid=13 error="signal: killed" in the pod log. I reproduced this error in several minutes (blue graph)

2022-02-11T23:05:17.578Z [DEBUG] plugin process exited: path=/plugin/jaeger-clickhouse pid=13 error="signal: killed"
{"level":"info","ts":1644609917.5783212,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to IDLE","system":"grpc","grpc_log":true}
2022-02-11T23:05:17.592Z [DEBUG] stdio: received EOF, stopping recv loop: err="rpc error: code = Canceled desc = context canceled"
{"level":"info","ts":1644609917.5928702,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0004da840, {IDLE <nil>}","system":"grpc","grpc_log":true}
{"level":"info","ts":1644609917.5929146,"caller":"channelz/logging.go:50","msg":"[core]Channel Connectivity change to IDLE","system":"grpc","grpc_log":true}
{"level":"info","ts":1644609917.5929787,"caller":"grpclog/component.go:71","msg":"[transport]transport: loopyWriter.run returning. connection error: desc = \"transport is closing\"","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3780131,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.378047,"caller":"channelz/logging.go:50","msg":"[core]Subchannel picks a new address \"unused\" to connect","system":"grpc","grpc_log":true}
{"level":"warn","ts":1644610105.3781164,"caller":"channelz/logging.go:75","msg":"[core]grpc: addrConn.createTransport failed to connect to {unused unused <nil>  0 }. Err: connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin2834833615: connect: connection refused\"","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3781302,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to TRANSIENT_FAILURE","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3781457,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0004da840, {CONNECTING <nil>}","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3781548,"caller":"channelz/logging.go:50","msg":"[core]Channel Connectivity change to CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3781884,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0004da840, {TRANSIENT_FAILURE connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin2834833615: connect: connection refused\"}","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610105.3781965,"caller":"channelz/logging.go:50","msg":"[core]Channel Connectivity change to TRANSIENT_FAILURE","system":"grpc","grpc_log":true}
{"level":"error","ts":1644610105.378254,"caller":"app/http_handler.go:487","msg":"HTTP handler, Internal Server Error","error":"plugin error: rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin2834833615: connect: connection refused\"","stacktrace":"github.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).handleError\n\tgithub.com/jaegertracing/jaeger/cmd/query/app/http_handler.go:487\ngithub.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).getOperationsLegacy\n\tgithub.com/jaegertracing/jaeger/cmd/query/app/http_handler.go:180\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\ngithub.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5\n\tgithub.com/opentracing-contrib/[email protected]/nethttp/server.go:154\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\tgithub.com/gorilla/[email protected]/mux.go:210\ngithub.com/jaegertracing/jaeger/cmd/query/app.additionalHeadersHandler.func1\n\tgithub.com/jaegertracing/jaeger/cmd/query/app/additional_headers_handler.go:28\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\ngithub.com/gorilla/handlers.CompressHandlerLevel.func1\n\tgithub.com/gorilla/[email protected]/compress.go:141\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\ngithub.com/gorilla/handlers.recoveryHandler.ServeHTTP\n\tgithub.com/gorilla/[email protected]/recovery.go:78\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2879\nnet/http.(*conn).serve\n\tnet/http/server.go:1930"}
{"level":"info","ts":1644610106.3785865,"caller":"channelz/logging.go:50","msg":"[core]Subchannel Connectivity change to IDLE","system":"grpc","grpc_log":true}
{"level":"info","ts":1644610106.3786347,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc0004da840, {IDLE connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin2834833615: connect: connection refused\"}","system":"grpc","grpc_log":true}``` 

vosmax avatar Feb 11 '22 20:02 vosmax

Jaeger Query exits on plugin crash starting from v1.33.0. This can be a workaround until the reason for the plugin crash is found.

leizhag avatar Jun 03 '22 03:06 leizhag