Taking 30 seconds to shut down
swift-otel has a bug that causes it to take 30 seconds to shut down.
This is particularly troublesome when using commands in Vapor development. Commands that should normally finish in a few seconds end up stopping. Moreover, it can be confusing because it's not obvious that otel is causing the stoppage.
I have created a project that easily reproduces the issue. Please check it out. Detailed steps are written in the README of that project.
https://github.com/omochi/otel-shutdown-hang-repro/
I will also post the reproduction code here. The issue occurs simply by setting up Vapor and OTel with a general configuration.
import Vapor
import Logging
import ServiceLifecycle
import Tracing
import OTel
import OTLPGRPC
struct Cmd: AsyncCommand {
struct Signature: CommandSignature {}
var help: String { "help" }
func run(using context: CommandContext, signature: Signature) async throws {
withSpan("hello cmd") { (span) in
print("hello cmd")
}
}
}
public struct App {
private static func makeTracer() async throws -> any Service {
let environment = OTelEnvironment.detected()
var resourceDetection = OTelResourceDetection(
detectors: [
OTelProcessResourceDetector(),
OTelEnvironmentResourceDetector(environment: environment)
]
)
let serviceName = "otel-app"
resourceDetection.detectors.append(
.manual(
OTelResource(attributes: [
"service.name": .string(serviceName)
])
)
)
let resource = await resourceDetection.resource(environment: environment)
let exporter = try OTLPGRPCSpanExporter(
configuration: .init(environment: environment),
group: .singletonMultiThreadedEventLoopGroup,
requestLogger: Logger(label: "SpanExporter"),
backgroundActivityLogger: Logger(label: "SpanExporterBG")
)
let idGenerator = OTelRandomIDGenerator()
let sampler = OTelConstantSampler(isOn: true)
let propagator = OTelW3CPropagator()
let processor = OTelBatchSpanProcessor(
exporter: exporter,
configuration: .init(
environment: environment
)
)
let tracer = OTelTracer(
idGenerator: idGenerator,
sampler: sampler,
propagator: propagator,
processor: processor,
environment: environment,
resource: resource
)
InstrumentationSystem.bootstrap(tracer)
return tracer
}
public static func main() async throws {
var env = try Environment.detect()
try LoggingSystem.bootstrap(from: &env)
let tracer = try await makeTracer()
let app = Application(env)
let serviceGroup = ServiceGroup(services: [tracer], logger: app.logger)
let serviceGroupTask = Task {
try await serviceGroup.run()
}
app.routes.get("") { (request) in
withSpan("GET /") { (span) in
return "hello"
}
}
app.asyncCommands.use(Cmd(), as: "cmd")
try await app.execute()
app.shutdown()
await serviceGroup.triggerGracefulShutdown()
try await serviceGroupTask.result.get()
}
}
Hey @omochi, thanks for the detailed bug report and demo project 🙏
Default export timeout
What you've been running into is the default export timeout, which is set to 30 seconds: https://github.com/slashmo/swift-otel/blob/dc0b208a3d7774d2a0ea7cb8ea657cdce6ca08cb/Sources/OTel/Tracing/Processing/Batch/OTelBatchSpanProcessorConfiguration.swift#L39
You can see this in your demo by running the example with debug logs enabled: swift run run cmd --log debug:
[ DEBUG ] Force flush timed out. (OTel/OTelBatchSpanProcessor.swift:91)
[ DEBUG ] Failed to export batch. [batch_id: 0, error: GRPCStatus, error_description: cancelled (1): RPC was cancelled by the client, cause: RPC was cancelled by the client] (OTel/OTelBatchSpanProcessor.swift:133)
Changing the export timeout
You can change the export timeout by passing an override when creating the OTelBatchSpanProcessor:
let processor = OTelBatchSpanProcessor(
exporter: exporter,
configuration: .init(
environment: environment,
exportTimeout: .seconds(10)
)
)
Alternatively, if you prefer to change this timeout at runtime, you can change it via the environment variable OTEL_BSP_EXPORT_TIMEOUT (setting the timeout in milliseconds):
OTEL_BSP_EXPORT_TIMEOUT=10000 swift run run cmd
Perhaps the documentation could be improved to highlight this more clearly, but I wouldn't consider this a bug in swift-otel.
super useful detail @slashmo - thanks for the info. I just saw this myself, couldn't quite understand why things appeared to hang a bit, but this makes perfect sense.
I just saw this myself, couldn't quite understand why things appeared to hang a bit, but this makes perfect sense. @heckj
Thanks for chiming in 🙏 It definitely seems like the visibility on this default should be improved. Out of curiosity, did you attempt to find an answer in the documentation first, or did you jump straight into GitHub Issues? I'm trying to figure out where to best document this.
I'd set up tracing in a test intentionally, and noticed the delay and thought "huh, okay" - but didn't investigate that as an issue directly. Then I happened across it in the issues when I was perusing the pull requests that expand the example setups, specifically to get something set up for that hummingbird 2.0 project I'd just established. Once I read through the issue, I thought. "Oh, yeah - that makes sense", and then followed up by looking at the other services that bootstrap, all of which had very tunable parameters for export that basically impact a fast-shutdown delay. The analogue for the whole suite of "services that export stuff on the side" seems relevant, and in my case it was related to "setting up for testing".
Granted, that's a pretty niche use case (at least I don't hear of others wanting to push that front too often), but if there was an article or such on setting up testing w/ Vapor and/or Hummingbird and these services, I'd definitely go for that. Sounds like @omochi was more focused on the development "quick iteration cycle", which I could see just as easily.