opentelemetry-collector
opentelemetry-collector copied to clipboard
Graceful shutdown when a component panics
Is your feature request related to a problem? Please describe.
When a component panics the Collector does not gracefully handle the panic and abruptly stops.
Describe the solution you'd like
The Collector would recover from the panic, correctly log it and gracefully shut down. This could be done with a catch-all recover call on the service.Collector code or at the level of components helpers which would recover and report a fatal error.
Describe alternatives you've considered
Continue as is without graceful shutdown.
@alolita please assign this issue to me.
@PaurushGarg @mx-psi I'm wondering if this is still being worked on. In our distribution of the collector it seems like panics are becoming more common as the number of components in contrib grows. Right now there's not a great way to gracefully handle panics from components and pass them along to the caller of the collector.
I am not actively working on this, feel free to take it if you want. My main concern with this was graceful shutdown, re-reading this maybe we want to raise the panic again after the shutdown. Feel free to open a PR and we can discuss there
Got it. I don't have capacity currently but I do think my team is interested in solving this issue. We may take this on in the next few weeks.
Note that I may have misunderstood something obvious here, so please let me know if that's the case.
Summary
I don't think a central recover is possible given go's panic and recover design. recover can only catch panics occurring on its own goroutine (from go's spec on panics). In other words, a goroutine can't recover from another goroutine's panic, even if it's a child goroutine (another reference).
Practically what this means is that to enforce graceful shutdown on panics, each component would need to implement recover functionality for each spawned goroutine. Even this may not be possible to some extent, as components may rely on a dependency that starts its own goroutines which in turn could then panic.
Practical example
I tested using the OTLP receiver ingesting traces using the GRPC protocol.
receivers:
otlp/receiver:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
I added a panic here for testing.
The following is the stack trace of different goroutines being spawned by the OTLP receiver to get to the introduced panic.
Stack traces
trace.(*Receiver).Export (otlp.go:45) go.opentelemetry.io/collector/receiver/otlpreceiver/internal/trace
ptraceotlp.rawTracesServer.Export (grpc.go:89) go.opentelemetry.io/collector/pdata/ptrace/ptraceotlp
<autogenerated>:2
v1._TraceService_Export_Handler.func1 (trace_service.pb.go:310) go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/trace/v1
configgrpc.enhanceWithClientInformation.func1 (configgrpc.go:396) go.opentelemetry.io/collector/config/configgrpc
v1._TraceService_Export_Handler (trace_service.pb.go:312) go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/trace/v1
grpc.(*Server).processUnaryRPC (server.go:1369) google.golang.org/grpc
grpc.(*Server).handleStream (server.go:1780) google.golang.org/grpc
grpc.(*Server).serveStreams.func2.1 (server.go:1019) google.golang.org/grpc
runtime.goexit (asm_amd64.s:1695) runtime
:1
- Async Stack Trace
grpc.(*Server).serveStreams.func2 (server.go:1030) google.golang.org/grpc
Created by:
grpc.(*Server).serveStreams.func2 (server.go:1030) google.golang.org/grpc
transport.(*http2Server).operateHeaders (http2_server.go:631) google.golang.org/grpc/internal/transport
transport.(*http2Server).HandleStreams (http2_server.go:672) google.golang.org/grpc/internal/transport
grpc.(*Server).serveStreams (server.go:1013) google.golang.org/grpc
grpc.(*Server).handleRawConn.func1 (server.go:949) google.golang.org/grpc
runtime.goexit (asm_amd64.s:1695) runtime
- Async Stack Trace
grpc.(*Server).handleRawConn (server.go:948) google.golang.org/grpc
Created by:
grpc.(*Server).handleRawConn (server.go:926) google.golang.org/grpc
grpc.(*Server).Serve.func3 (server.go:917) google.golang.org/grpc
runtime.goexit (asm_amd64.s:1695) runtime
:1
- Async Stack Trace
grpc.(*Server).Serve (server.go:916) google.golang.org/grpc
Created by:
grpc.(*Server).Serve (server.go:916) google.golang.org/grpc
otlpreceiver.(*otlpReceiver).startGRPCServer.func1 (otlp.go:117) go.opentelemetry.io/collector/receiver/otlpreceiver
runtime.goexit (asm_amd64.s:1695) runtime
:1
- Async Stack Trace
otlpreceiver.(*otlpReceiver).startGRPCServer (otlp.go:109) go.opentelemetry.io/collector/receiver/otlpreceiver
The result of this is that even the receiver's main goroutine calling startGRPCServer can't handle recovering from the panic. This rules out a uniform way to handle this in the collector, from what I can tell. The panic in the export goroutine becomes a fatal panic once all of its defer methods have been called, it doesn't have a chance to bubble all the way back up to the goroutine running the collector's Run method. Once a goroutine exits from a panic that has not been recovered from, the whole program will terminate/crash.
@crobert-1 I think the main factor on deciding here is whether the amount of panics that we would be able to catch with one or more recovers is useful/significant to justify the effort of implementing this. We won't be able to catch all panics, but we would be able to catch some.
May have some dependency on https://github.com/open-telemetry/opentelemetry-collector/issues/9324
I am not sure if this should be a target for service 1.0 (and therefore Collector 1.0), it's something we can do after 1.0 as an enhancement. Any thoughts?
I would vote for not including in 1.0.
Agreed, I removed this