grpc-swift
grpc-swift copied to clipboard
Make exceptions visible to the developer and potentially terminate the GRCPServer
Improve the handling of exceptions in GRPCServer such that fatal errors terminate the server and are reported to the developer.
Is your feature request related to a problem? Please describe it.
In my application, when GRPCServer.serve() is executed, an error occurs while setting up TLS in the transport layer causing an exception to be thrown. This exception is caught and rethrown by GRPCServer.serve() but it does not ever appear to be printed or otherwise made available to the developer; instead the exception appears to be silently swallowed. The GRPC server appears to be running fine when, in fact, it has failed.
Describe the solution you'd like
Fatal errors should cause the server to terminate, gracefully when possible. The developer should be informed about the exception to facilitate debugging and, in production, application monitoring. Silent failures are very bad.
Describe alternatives you've considered
If necessary, a Logger could be passed to GRPCServer.serve, but I presume there are good reasons for not doing so or it would already be the case.
Additional context
In my application, the TransportServices flavor of HTTP2ServerTransport is being used, but the error handling appears to be more general and within GRPCCore. Additionally, I'm using the ServiceLifetime run() code from grpc-swift-extras.
And, finally, this is my first grpc-swift 2.x server so it is possible that an RTFM answer is in order. Should that be the case, I apologize in advance.
In my application, when GRPCServer.serve() is executed, an error occurs while setting up TLS in the transport layer causing an exception to be thrown. This exception is caught and rethrown by GRPCServer.serve() but it does not ever appear to be printed or otherwise made available to the developer; instead the exception appears to be silently swallowed. The GRPC server appears to be running fine when, in fact, it has failed.
That's surprising. If GRPCServer.serve() throws then that error should be propagated into your program and handled there.
Are you sure the error is actually thrown from serve()?
I know the http2 transports do swallow some errors (certainly at the stream layer) so my hunch is that actually it's being incorrectly swallowed in the transport rather than in GRPCServer.
Fatal errors should cause the server to terminate, gracefully when possible. The developer should be informed about the exception to facilitate debugging and, in production, application monitoring. Silent failures are very bad.
I agree wholeheartedly that silent errors are usually very bad!
If necessary, a Logger could be passed to GRPCServer.serve, but I presume there are good reasons for not doing so or it would already be the case.
The reason is that there's no one-true-logger in Swift: there's swift-log and OSLog. This is an issue we want to address though. I think in all likelihood we'll end up with a different abstraction and offer an integration with swift-log and OSLog, possibly behind a package trait.
As always, glbrntt, thank you for your insight.
Bypassing for a second the question of whether this report belongs against grcp-swift, let me summarize the results of my investigation. Your instincts are good and the exception from the transport layer is indeed reaching GRPCServer.serve(). Unfortunately, the error description is lost when the exception is rethrown: "Server transport threw an error." This is suboptimal but I can appreciate that there are higher immediate priorities. I can file a new issue against grpc-swift-2 for potential resolution at a future date.
Here's where things get interesting and where my ignorance shows. Once the error is rethrown, the exception propagates to the run() function from grpc-swift-extras (Sources/GRPCServiceLifecycle/GRPCServer+Service.swift) and then into ServiceGroup from swift-service-lifecycle where the exception appears to vanish.
do {
try await serviceGroup.run()
} catch {
print("EXCEPTION: \(error.localizedDescription)")
}
The catch is never triggered and my server code happily proceeds as if nothing is wrong. Is the problem here with grpc-swift-extras as I thought earlier in the week, with swift-service-lifecycle, or (perhaps most likely) with how I'm using ServiceLifecycle? I don't currently know.
In any event, I propose closing this issue against grpc-swift. I can then write a narrowly focused issue against grpc-swift-2 about rethrowing transport exceptions. And I can continue to investigate.
Unfortunately, the error description is lost when the exception is rethrown: "Server transport threw an error." This is suboptimal but I can appreciate that there are higher immediate priorities. I can file a new issue against grpc-swift-2 for potential resolution at a future date.
The error should include a description of the underlying error too. See here and here.
Here's where things get interesting and where my ignorance shows. Once the error is rethrown, the exception propagates to the run() function from grpc-swift-extras (Sources/GRPCServiceLifecycle/GRPCServer+Service.swift) and then into ServiceGroup from swift-service-lifecycle where the exception appears to vanish.
do { try await serviceGroup.run() } catch { print("EXCEPTION: \(error.localizedDescription)") }The catch is never triggered and my server code happily proceeds as if nothing is wrong. Is the problem here with grpc-swift-extras as I thought earlier in the week, with swift-service-lifecycle, or (perhaps most likely) with how I'm using ServiceLifecycle? I don't currently know.
I'm a bit surprised that the error isn't thrown by the run() from the service group. Can you reproduce this? (I tried with a basic example but it did throw the error from the service.)
TLDR; Closing this issue because the error exists outside grpc-swift. Error reporting in grpc-swift could be better -- but I can't throw rocks because my own code reflects the same criticism no doubt. 😜
First, a high level review of the problem. On execution, our grpc-swift-2 server ran but did not open the port. No error messages were produced. No exceptions reached our code. And the server executable continued to run.
What really happened?
- In package grpc-swif-nio-transport, in file HTTP2ServerTransport+TransportServices.swift, at line 269, the SecIdentity provided by our code is passed to sec_identity_create which fails. This generates an exception that "There was an issue creating the SecIdentity required to set up TLS. Please check your TLS configuration."
- Back in
main(), the GRCP server and several other Swift ServiceLifecycle compliant services were run in a ServiceGroup. The ServiceGroup "uses task cancellation to tear down everything when a service' run() method returns early of throws an error". - As ServiceGroup uses graceful termination, each service can take time to perform cleanup. Unfortunately, one of the services (not GRPC Swift) chose to ignore the termination request.
- ServiceGroup continued to wait for all of the services to finish and held the exception. As a result, the developer code was never alerted to the error and no other log messages were generated.
With the badly performing service corrected, a GRPCCore.RuntimeError now reaches our code with { message: "Server transport threw an error.", cause: "unable to read data" }. While less clear than the original error, it's much better than nothing.
I'd like to apologize for the delay in updating this issue. After correcting the SecIdentity, the problem vanished and the team focused on positive functionality. I'm happy to report that, so far, everything has gone smoothly. Only recently were we able to returning to the negative cases and reproducing this behavior. Secondly, my inexperience with Swift ServiceLifecycle is clear. I never imagined that a completely unrelated service would block errors from grpc-swift -- but that does appear to be the case. I imagine that a feature request against swift-service-lifecycle will follow in the near future.
Thank you so much for the detailed follow-up on this issue. We really appreciate it, and it's enormously useful for those who follow in your footsteps.
Please do also file the feature request against ServiceLifecycle. This seems like a major need.
NOTE: If is possible to specify a maximum time to wait for graceful shutdown / cancellation before treating the situation as a fatalError().
var conf = ServiceGroupConfiguration(services: [pg, errorLoggerService, server], gracefulShutdownSignals: [.sigterm], logger: logger)
conf.maximumCancellationDuration = .seconds(15)
conf.maximumGracefulShutdownDuration = .seconds(15)
let serviceGroup = ServiceGroup(configuration: conf)
// Various initialization operations go here.
do {
logger.info("Starting Server on port \(port) with pid \(pid ?? "unspecified")")
try await serviceGroup.run()
} catch {
logger.error("Exception while running: \(error.localizedDescription)")
}
Issue filed against swift-service-lifecycle. #206
Thank you so much for the detailed follow-up on this issue. We really appreciate it, and it's enormously useful for those who follow in your footsteps.
I'd like to echo this sentiment, it's really helpful, so thank you.