grpc-go
grpc-go copied to clipboard
Fix close in use certificate providers after double `Close()` method call on wrapper object
wrappedProvider add refcounter to certprovider.Provider, but there no guards to prevent multiple Close() call after single Build() call.
The situation is complicated by:
- the error does not appear where there was a double closure;
- before the rotation of certificates, the problem may not manifest itself (that is, the effect is greatly delayed in time);
certprovider.Providerimplementation don't panic/error on multipleClose()calls.
Fix errors like:
[core][Channel #39 SubChannel #941] grpc: addrConn.createTransport failed to connect to ...
Err: connection error: desc = "transport: authentication handshake failed:
xds: fetching trusted roots from CertificateProvider failed: provider instance is closed"
RELEASE NOTES:
- Fix
xds: fetching trusted roots from CertificateProvider failed: provider instance is closederror
@bozaro -- Calling Close() feels more like a User error to me and wouldnt categorize this as a bug. I agree that the API could have been better by not providing a Close() method which can be called multiple times, which would cause incorrect accounting wrt wp.refCount.
That being said, inorder to fix this, I would rather make Provider.Close a grpcsync.OnceFunc. That way we can cleanly assert that a reference can only decrement their reference when calling Close().
Calling Close() feels more like a User error to me and wouldnt categorize this as a bug.
I think that even if the double Close() call is considered a user error, it is useful to do a check here to localize the consequences of it and not play the whack-a-mole game.
The situation when all neighboring uses must be written correctly for your code to work is very fragile.
I have found at least two places through which double Close() is possible:
- https://github.com/grpc/grpc-go/blob/a4afd4d995b0e60b9beb7b54923fa74ef97a5098/xds/internal/balancer/cdsbalancer/cdsbalancer.go#L374-L376
- https://github.com/grpc/grpc-go/blob/a4afd4d995b0e60b9beb7b54923fa74ef97a5098/xds/internal/server/conn_wrapper.go#L161-L163
That being said, inorder to fix this, I would rather make
Provider.Closeagrpcsync.OnceFunc.
I can also add a wrapper in grpcsync.OnceFunc, but this will not make the code simpler: I still need to check the closing status to return an error from provider.KeyMaterial of the closed wrapper.
@bozaro : Thank you for figuring this out and sending us a PR.
There are a few options available to fix this issue:
- Make breaking API changes to this package
- Remove the
Closemethod from theProviderinterface - Change the signature of the
starterargument passed toNewBuildableConfigto return a tupleProvider, func(). The second return value is aclosefunction to be invoked when the provider is to be closed. - Change
BuildableConfig.Build()to return three values:(Provider, func(), error). The second return value would the close function returned by the provider wrapped in agrpcsync.OnceFunc. - With this approach, the user of a provider will be able to close it how many ever times they want, and things will still continue to work cleanly. But this approach could hide potential bugs in the user's code that never show up in their tests, but only happen in production. Therefore we want to avoid this approach
- Remove the
- The approach taken by you, where we return a separate per-instantiation wrapper that wraps the current
wrappedProvider.- With this approach, the user of a provider instance that has called
Closewill start seeingerrProviderClosedon subsequent invocations ofKeyMaterial() - While this approach is a significant improvement over the existing code, this still will not help the user easily debug when and where they are double closing.
- With this approach, the user of a provider instance that has called
Instead the approach that we like (which is quite similar to your approach) is as follows:
- Have a per-instantiation wrapper that wraps the current
wrappedProvider. Call it something else,wrappedProviderCloseris too close towrappedProvider. Maybe something likesingleCloseWrappedProviderorsingleRefWrappedProvideror something better.
type singleCloseWrappedProvider struct {
mu sync.Mutex
provider Provider
}
func (s *singleCloseWrappedProvider) KeyMaterial(ctx context.Context) (*KeyMaterial, error) {
s.mu.Lock()
p := s.provider
s.mu.Unlock()
return p.KeyMaterial(ctx)
}
func (s *singleCloseWrappedProvider) Close() {
s.mu.Lock()
s.provider.Close()
s.provider = panickingProvider{}
s.mu.Unlock()
}
- Have a provider implementation that panics in its
Close()method.
type panickingProvider struct{}
func (panickingProvider) Close() {
panic("Close called more than once on certificate provider")
}
func (panickingProvider) KeyMaterial(context.Context) (*KeyMaterial, error) {
return nil, errProviderClosed
}
- Change
BuildableConfig.Build()to always return an instance ofsingleCloseWrappedProviderwrapping a newly createdwrappedProvideror an existing one (in which case, the ref count on it would have to be incremented, as is done today)
Please let me know your thoughts on this approach.
Make breaking API changes to this package...
This approach has the opposite problem - you can use Provider after closing if it is used by someone else.
The approach taken by you, where we return a separate per-instantiation wrapper that wraps the current wrappedProvider...
I don't think double closure is a problem that needs to be dealt with:
- Go often uses a pattern like:
func foo() error {
x, err := os.Open(fileName)
if err != nil {
return err
}
defer x.Close()
// ...
if err := x.Close(); err != nil {
return err
}
// ...
return nil
}
- Errors will be localized:
- Close of one wrapper will not break neighboring ones;
- An unexpected Close wrapper will break code that continues to use the same wrapper;
- The current Provider implementation ignores double
Closecalls.
Instead the approach that we like (which is quite similar to your approach) is as follows:
I have corrected the PR, except that I do not panic on double Close call.
@easwars, as a result, I'm trying to make the wrappedProvider have the same behavior as the certprovider.Provider itself.
I don't think double closure is a problem that needs to be dealt with:
I agree that there are things like files and conns, where Go supports multiple calls to close. But there are things like channels, which panic when closed more than once.
But making the double close lead to panic in our case, we can ensure that our users are notified right at the moment where they are doing something wrong (i.e closing the same provider more than once). And this also ensures that this can be caught in their testing.
I agree that both approaches (whether we panic or not on double close) result in KeyMaterial returning errProviderClosed when a closed provider is used by the user. And this also possibly be caught in their testing.
@easwars, I tried adding panic when closing twice: https://github.com/bozaro/grpc-go/pull/2/files (all unit tests green).
But:
- This is a breaking change.
- I don't really understand what this
panicgives in this case from a practical point of view. - Unfortunately, I am not able to reproduce problems with double
Close()in a test environment, only in production (requirexds:///to reproduce issue). - When using the server in production with
xds:///+ mTLS, there is guaranteed (or very often) to be a doubleClose()call at the shutdown stage.
When using the server in production with xds:/// + mTLS, there is guaranteed (or very often) to be a double Close() call at the shutdown stage.
Hmm ... That is interesting. Were you able to narrow down to why there is a good chance of a double close in this scenario?
@dfawley for second set of eyes.
Can you provide any more information about how this happens?
The most common example of double Close() (I don't think it's the only one):
First Close() call:
google.golang.org/grpc/xds/internal/server.(*connWrapper).Close(0x4008382d80)
go/pkg/mod/google.golang.org/[email protected]/xds/internal/server/conn_wrapper.go:162 +0x50
crypto/tls.(*Conn).Close(0x40031f6e08?)
GOROOT/src/crypto/tls/conn.go:1429 +0xf4
google.golang.org/grpc/credentials/xds.(*credsImpl).ServerHandshake(0x40048b0030, {0x36bf288, 0x4008382d80})
go/pkg/mod/google.golang.org/[email protected]/credentials/xds/xds.go:250 +0x3f8
google.golang.org/grpc/internal/transport.NewServerTransport({0x36bf288, 0x4008382d80}, 0x400170be58)
go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:146 +0x84
google.golang.org/grpc.(*Server).newHTTP2Transport(0x40019f2a00, {0x36bf288, 0x4008382d80})
go/pkg/mod/google.golang.org/[email protected]/server.go:974 +0xfc
google.golang.org/grpc.(*Server).handleRawConn(0x40019f2a00, {0x4001db7f80, 0x9}, {0x36bf288, 0x4008382d80})
go/pkg/mod/google.golang.org/[email protected]/server.go:933 +0x94
google.golang.org/grpc.(*Server).Serve.func3()
go/pkg/mod/google.golang.org/[email protected]/server.go:917 +0x64
created by google.golang.org/grpc.(*Server).Serve in goroutine 429
go/pkg/mod/google.golang.org/[email protected]/server.go:916 +0x4f4
Second Close() call:
google.golang.org/grpc/xds/internal/server.(*connWrapper).Close(0x4008382d80)
go/pkg/mod/google.golang.org/[email protected]/xds/internal/server/conn_wrapper.go:162 +0x50
google.golang.org/grpc.(*Server).newHTTP2Transport(0x40019f2a00, {0x36bf288, 0x4008382d80})
go/pkg/mod/google.golang.org/[email protected]/server.go:986 +0x37c
google.golang.org/grpc.(*Server).handleRawConn(0x40019f2a00, {0x4001db7f80, 0x9}, {0x36bf288, 0x4008382d80})
go/pkg/mod/google.golang.org/[email protected]/server.go:933 +0x94
google.golang.org/grpc.(*Server).Serve.func3()
go/pkg/mod/google.golang.org/[email protected]/server.go:917 +0x64
created by google.golang.org/grpc.(*Server).Serve in goroutine 429
go/pkg/mod/google.golang.org/[email protected]/server.go:916 +0x4f4
@bozaro : Thank you very much for the stack traces.
I think this is what is happening:
- xDS enabled gRPC server gets a connection
- This server uses a custom
net.Listenerwhich accepts the connection and returns aconnWrapperhere: https://github.com/grpc/grpc-go/blob/e22436abb80982dab3294250b7284eedd8f20768/xds/internal/server/listener_wrapper.go#L343 - The gRPC server then spawns a goroutine to handle this connection here: https://github.com/grpc/grpc-go/blob/e22436abb80982dab3294250b7284eedd8f20768/server.go#L927
- As part of handling the new connection, an HTTP/2 server transport is created here: https://github.com/grpc/grpc-go/blob/e22436abb80982dab3294250b7284eedd8f20768/server.go#L984
- If the server transport creation fails and the returned error value is not
credentials.ErrConnDispatched, the rawConn is closed, which ends up callingCloseon theconnWrapper(this will be the second close)
- If the server transport creation fails and the returned error value is not
- As part of creating the HTTP/2 server transport, the server TLS handshake is initiated here: https://github.com/grpc/grpc-go/blob/e22436abb80982dab3294250b7284eedd8f20768/internal/transport/http2_server.go#L146
- This calls to
ServerHandshakeis handled by the xDS credentials implementation here: https://github.com/grpc/grpc-go/blob/e22436abb80982dab3294250b7284eedd8f20768/credentials/xds/xds.go#L207 - It wraps the rawConn in a
tls.Connhere: https://github.com/grpc/grpc-go/blob/e22436abb80982dab3294250b7284eedd8f20768/credentials/xds/xds.go#L248 - It then performs the TLS handshake by calling the
Handshakemethod on thetls.Connhere: https://github.com/grpc/grpc-go/blob/e22436abb80982dab3294250b7284eedd8f20768/credentials/xds/xds.go#L249- And if the handshake fails, then it calls
Closeon thetls.Conn, which ends up callingCloseon the wrapped conn, which essentially lands inconnWrapper(this will be the first close)
- And if the handshake fails, then it calls
Also, I see that we have defined this error sentinel ErrConnDispatched here: https://github.com/grpc/grpc-go/blob/e22436abb80982dab3294250b7284eedd8f20768/credentials/credentials.go#L125, but none of the credentials implementations are returning this error.
I feel that the xDS credentials implementation should return this error in the case where the server handshake failed, and it ended up calling Close on the tls.Conn.
@dfawley
Do you know some history behind this error sentinel? What do you think about returning this value from the xDS creds implementation in the above case? Thanks.
Dispatched is used to allow another protocol to handle the connection. That's not intended for this kind of use case.
Interesting sequence of events. It seems if we use TLS to wrap a connection, we need to allow for double-close (either in connWrapper or certprovider), because:
- we can't start requiring that if any custom
ServerHandshakeerrors, that it closes the connection (as that would be a behavior change), and - we probably shouldn't remove the
conn.Closefrom our TLS handshaker, because not calling theClosemethod on atls.Connis probably a bad idea (even if it doesn't leak anything today).
Do I need to do anything with this PR (e.g. rebase)?
@dfawley : Based on our offline discussion, the changes here reflect mostly what we want. We can track the double close in the TLS code path separately (if required, and if we can do anything actionable there).
This is waiting for your second review, apart from the minor nit that I have above.