opentelemetry-collector icon indicating copy to clipboard operation
opentelemetry-collector copied to clipboard

Graceful shutdown when a component panics

Open mx-psi opened this issue 2 years ago • 7 comments

Is your feature request related to a problem? Please describe.

When a component panics the Collector does not gracefully handle the panic and abruptly stops.

Describe the solution you'd like

The Collector would recover from the panic, correctly log it and gracefully shut down. This could be done with a catch-all recover call on the service.Collector code or at the level of components helpers which would recover and report a fatal error.

Describe alternatives you've considered

Continue as is without graceful shutdown.

mx-psi avatar Sep 01 '21 11:09 mx-psi

@alolita please assign this issue to me.

PaurushGarg avatar Sep 29 '21 17:09 PaurushGarg

@PaurushGarg @mx-psi I'm wondering if this is still being worked on. In our distribution of the collector it seems like panics are becoming more common as the number of components in contrib grows. Right now there's not a great way to gracefully handle panics from components and pass them along to the caller of the collector.

cpheps avatar Nov 08 '22 13:11 cpheps

I am not actively working on this, feel free to take it if you want. My main concern with this was graceful shutdown, re-reading this maybe we want to raise the panic again after the shutdown. Feel free to open a PR and we can discuss there

mx-psi avatar Nov 08 '22 13:11 mx-psi

Got it. I don't have capacity currently but I do think my team is interested in solving this issue. We may take this on in the next few weeks.

cpheps avatar Nov 08 '22 14:11 cpheps

Note that I may have misunderstood something obvious here, so please let me know if that's the case.

Summary

I don't think a central recover is possible given go's panic and recover design. recover can only catch panics occurring on its own goroutine (from go's spec on panics). In other words, a goroutine can't recover from another goroutine's panic, even if it's a child goroutine (another reference).

Practically what this means is that to enforce graceful shutdown on panics, each component would need to implement recover functionality for each spawned goroutine. Even this may not be possible to some extent, as components may rely on a dependency that starts its own goroutines which in turn could then panic.

Practical example

I tested using the OTLP receiver ingesting traces using the GRPC protocol.

receivers:
  otlp/receiver:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

I added a panic here for testing.

The following is the stack trace of different goroutines being spawned by the OTLP receiver to get to the introduced panic.

Stack traces
trace.(*Receiver).Export (otlp.go:45) go.opentelemetry.io/collector/receiver/otlpreceiver/internal/trace
ptraceotlp.rawTracesServer.Export (grpc.go:89) go.opentelemetry.io/collector/pdata/ptrace/ptraceotlp
<autogenerated>:2
v1._TraceService_Export_Handler.func1 (trace_service.pb.go:310) go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/trace/v1
configgrpc.enhanceWithClientInformation.func1 (configgrpc.go:396) go.opentelemetry.io/collector/config/configgrpc
v1._TraceService_Export_Handler (trace_service.pb.go:312) go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/trace/v1
grpc.(*Server).processUnaryRPC (server.go:1369) google.golang.org/grpc
grpc.(*Server).handleStream (server.go:1780) google.golang.org/grpc
grpc.(*Server).serveStreams.func2.1 (server.go:1019) google.golang.org/grpc
runtime.goexit (asm_amd64.s:1695) runtime
:1
 - Async Stack Trace
grpc.(*Server).serveStreams.func2 (server.go:1030) google.golang.org/grpc

Created by:

grpc.(*Server).serveStreams.func2 (server.go:1030) google.golang.org/grpc
transport.(*http2Server).operateHeaders (http2_server.go:631) google.golang.org/grpc/internal/transport
transport.(*http2Server).HandleStreams (http2_server.go:672) google.golang.org/grpc/internal/transport
grpc.(*Server).serveStreams (server.go:1013) google.golang.org/grpc
grpc.(*Server).handleRawConn.func1 (server.go:949) google.golang.org/grpc
runtime.goexit (asm_amd64.s:1695) runtime
 - Async Stack Trace
grpc.(*Server).handleRawConn (server.go:948) google.golang.org/grpc

Created by:

grpc.(*Server).handleRawConn (server.go:926) google.golang.org/grpc
grpc.(*Server).Serve.func3 (server.go:917) google.golang.org/grpc
runtime.goexit (asm_amd64.s:1695) runtime
:1
 - Async Stack Trace
grpc.(*Server).Serve (server.go:916) google.golang.org/grpc

Created by:

grpc.(*Server).Serve (server.go:916) google.golang.org/grpc
otlpreceiver.(*otlpReceiver).startGRPCServer.func1 (otlp.go:117) go.opentelemetry.io/collector/receiver/otlpreceiver
runtime.goexit (asm_amd64.s:1695) runtime
:1
 - Async Stack Trace
otlpreceiver.(*otlpReceiver).startGRPCServer (otlp.go:109) go.opentelemetry.io/collector/receiver/otlpreceiver

The result of this is that even the receiver's main goroutine calling startGRPCServer can't handle recovering from the panic. This rules out a uniform way to handle this in the collector, from what I can tell. The panic in the export goroutine becomes a fatal panic once all of its defer methods have been called, it doesn't have a chance to bubble all the way back up to the goroutine running the collector's Run method. Once a goroutine exits from a panic that has not been recovered from, the whole program will terminate/crash.

crobert-1 avatar Apr 18 '24 22:04 crobert-1

@crobert-1 I think the main factor on deciding here is whether the amount of panics that we would be able to catch with one or more recovers is useful/significant to justify the effort of implementing this. We won't be able to catch all panics, but we would be able to catch some.

mx-psi avatar Apr 19 '24 10:04 mx-psi

May have some dependency on https://github.com/open-telemetry/opentelemetry-collector/issues/9324

TylerHelmuth avatar Apr 22 '24 16:04 TylerHelmuth