grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

Fix close in use certificate providers after double `Close()` method call on wrapper object

Open bozaro opened this issue 1 year ago • 4 comments
trafficstars

wrappedProvider add refcounter to certprovider.Provider, but there no guards to prevent multiple Close() call after single Build() call.

The situation is complicated by:

  • the error does not appear where there was a double closure;
  • before the rotation of certificates, the problem may not manifest itself (that is, the effect is greatly delayed in time);
  • certprovider.Provider implementation don't panic/error on multiple Close() calls.

Fix errors like:

[core][Channel #39 SubChannel #941] grpc: addrConn.createTransport failed to connect to ...
Err: connection error: desc = "transport: authentication handshake failed:
xds: fetching trusted roots from CertificateProvider failed: provider instance is closed"

RELEASE NOTES:

  • Fix xds: fetching trusted roots from CertificateProvider failed: provider instance is closed error

bozaro avatar Apr 16 '24 07:04 bozaro

@bozaro -- Calling Close() feels more like a User error to me and wouldnt categorize this as a bug. I agree that the API could have been better by not providing a Close() method which can be called multiple times, which would cause incorrect accounting wrt wp.refCount.

That being said, inorder to fix this, I would rather make Provider.Close a grpcsync.OnceFunc. That way we can cleanly assert that a reference can only decrement their reference when calling Close().

arvindbr8 avatar Apr 17 '24 21:04 arvindbr8

Calling Close() feels more like a User error to me and wouldnt categorize this as a bug.

I think that even if the double Close() call is considered a user error, it is useful to do a check here to localize the consequences of it and not play the whack-a-mole game.

The situation when all neighboring uses must be written correctly for your code to work is very fragile.

I have found at least two places through which double Close() is possible:

  • https://github.com/grpc/grpc-go/blob/a4afd4d995b0e60b9beb7b54923fa74ef97a5098/xds/internal/balancer/cdsbalancer/cdsbalancer.go#L374-L376
  • https://github.com/grpc/grpc-go/blob/a4afd4d995b0e60b9beb7b54923fa74ef97a5098/xds/internal/server/conn_wrapper.go#L161-L163

That being said, inorder to fix this, I would rather make Provider.Close a grpcsync.OnceFunc.

I can also add a wrapper in grpcsync.OnceFunc, but this will not make the code simpler: I still need to check the closing status to return an error from provider.KeyMaterial of the closed wrapper.

bozaro avatar Apr 18 '24 05:04 bozaro

@bozaro : Thank you for figuring this out and sending us a PR.

There are a few options available to fix this issue:

  • Make breaking API changes to this package
    • Remove the Close method from the Provider interface
    • Change the signature of the starter argument passed to NewBuildableConfig to return a tuple Provider, func(). The second return value is a close function to be invoked when the provider is to be closed.
    • Change BuildableConfig.Build() to return three values: (Provider, func(), error) . The second return value would the close function returned by the provider wrapped in a grpcsync.OnceFunc.
    • With this approach, the user of a provider will be able to close it how many ever times they want, and things will still continue to work cleanly. But this approach could hide potential bugs in the user's code that never show up in their tests, but only happen in production. Therefore we want to avoid this approach
  • The approach taken by you, where we return a separate per-instantiation wrapper that wraps the current wrappedProvider.
    • With this approach, the user of a provider instance that has called Close will start seeing errProviderClosed on subsequent invocations of KeyMaterial()
    • While this approach is a significant improvement over the existing code, this still will not help the user easily debug when and where they are double closing.

Instead the approach that we like (which is quite similar to your approach) is as follows:

  • Have a per-instantiation wrapper that wraps the current wrappedProvider. Call it something else, wrappedProviderCloser is too close to wrappedProvider. Maybe something like singleCloseWrappedProvider or singleRefWrappedProvider or something better.
type singleCloseWrappedProvider struct {
       mu       sync.Mutex
       provider Provider
}

func (s *singleCloseWrappedProvider) KeyMaterial(ctx context.Context) (*KeyMaterial, error) {
       s.mu.Lock()
       p := s.provider
       s.mu.Unlock()
       return p.KeyMaterial(ctx)
}
func (s *singleCloseWrappedProvider) Close() {
       s.mu.Lock()
       s.provider.Close()
       s.provider = panickingProvider{}
       s.mu.Unlock()
}
  • Have a provider implementation that panics in its Close() method.
type panickingProvider struct{}

func (panickingProvider) Close() {
       panic("Close called more than once on certificate provider")
}
func (panickingProvider) KeyMaterial(context.Context) (*KeyMaterial, error) {
       return nil, errProviderClosed
}
  • Change BuildableConfig.Build() to always return an instance of singleCloseWrappedProvider wrapping a newly created wrappedProvider or an existing one (in which case, the ref count on it would have to be incremented, as is done today)

Please let me know your thoughts on this approach.

easwars avatar May 07 '24 21:05 easwars

Make breaking API changes to this package...

This approach has the opposite problem - you can use Provider after closing if it is used by someone else.

The approach taken by you, where we return a separate per-instantiation wrapper that wraps the current wrappedProvider...

I don't think double closure is a problem that needs to be dealt with:

  1. Go often uses a pattern like:
func foo() error {
	x, err := os.Open(fileName)
	if err != nil {
		return err
	}
	defer x.Close()
	// ...
	if err := x.Close(); err != nil {
		return err
	}
	// ...
	return nil
}
  1. Errors will be localized:
    • Close of one wrapper will not break neighboring ones;
    • An unexpected Close wrapper will break code that continues to use the same wrapper;
  2. The current Provider implementation ignores double Close calls.

Instead the approach that we like (which is quite similar to your approach) is as follows:

I have corrected the PR, except that I do not panic on double Close call.

bozaro avatar May 11 '24 07:05 bozaro

@easwars, as a result, I'm trying to make the wrappedProvider have the same behavior as the certprovider.Provider itself.

bozaro avatar May 15 '24 18:05 bozaro

I don't think double closure is a problem that needs to be dealt with:

I agree that there are things like files and conns, where Go supports multiple calls to close. But there are things like channels, which panic when closed more than once.

But making the double close lead to panic in our case, we can ensure that our users are notified right at the moment where they are doing something wrong (i.e closing the same provider more than once). And this also ensures that this can be caught in their testing.

I agree that both approaches (whether we panic or not on double close) result in KeyMaterial returning errProviderClosed when a closed provider is used by the user. And this also possibly be caught in their testing.

easwars avatar May 15 '24 19:05 easwars

@easwars, I tried adding panic when closing twice: https://github.com/bozaro/grpc-go/pull/2/files (all unit tests green).

But:

  • This is a breaking change.
  • I don't really understand what this panic gives in this case from a practical point of view.
  • Unfortunately, I am not able to reproduce problems with double Close() in a test environment, only in production (require xds:/// to reproduce issue).
  • When using the server in production with xds:/// + mTLS, there is guaranteed (or very often) to be a double Close() call at the shutdown stage.

bozaro avatar May 16 '24 11:05 bozaro

When using the server in production with xds:/// + mTLS, there is guaranteed (or very often) to be a double Close() call at the shutdown stage.

Hmm ... That is interesting. Were you able to narrow down to why there is a good chance of a double close in this scenario?

easwars avatar May 16 '24 19:05 easwars

@dfawley for second set of eyes.

easwars avatar May 16 '24 19:05 easwars

Can you provide any more information about how this happens?

The most common example of double Close() (I don't think it's the only one):

First Close() call:

google.golang.org/grpc/xds/internal/server.(*connWrapper).Close(0x4008382d80)
	go/pkg/mod/google.golang.org/[email protected]/xds/internal/server/conn_wrapper.go:162 +0x50
crypto/tls.(*Conn).Close(0x40031f6e08?)
	GOROOT/src/crypto/tls/conn.go:1429 +0xf4
google.golang.org/grpc/credentials/xds.(*credsImpl).ServerHandshake(0x40048b0030, {0x36bf288, 0x4008382d80})
	go/pkg/mod/google.golang.org/[email protected]/credentials/xds/xds.go:250 +0x3f8
google.golang.org/grpc/internal/transport.NewServerTransport({0x36bf288, 0x4008382d80}, 0x400170be58)
	go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:146 +0x84
google.golang.org/grpc.(*Server).newHTTP2Transport(0x40019f2a00, {0x36bf288, 0x4008382d80})
	go/pkg/mod/google.golang.org/[email protected]/server.go:974 +0xfc
google.golang.org/grpc.(*Server).handleRawConn(0x40019f2a00, {0x4001db7f80, 0x9}, {0x36bf288, 0x4008382d80})
	go/pkg/mod/google.golang.org/[email protected]/server.go:933 +0x94
google.golang.org/grpc.(*Server).Serve.func3()
	go/pkg/mod/google.golang.org/[email protected]/server.go:917 +0x64
created by google.golang.org/grpc.(*Server).Serve in goroutine 429
	go/pkg/mod/google.golang.org/[email protected]/server.go:916 +0x4f4

Second Close() call:

google.golang.org/grpc/xds/internal/server.(*connWrapper).Close(0x4008382d80)
	go/pkg/mod/google.golang.org/[email protected]/xds/internal/server/conn_wrapper.go:162 +0x50
google.golang.org/grpc.(*Server).newHTTP2Transport(0x40019f2a00, {0x36bf288, 0x4008382d80})
	go/pkg/mod/google.golang.org/[email protected]/server.go:986 +0x37c
google.golang.org/grpc.(*Server).handleRawConn(0x40019f2a00, {0x4001db7f80, 0x9}, {0x36bf288, 0x4008382d80})
	go/pkg/mod/google.golang.org/[email protected]/server.go:933 +0x94
google.golang.org/grpc.(*Server).Serve.func3()
	go/pkg/mod/google.golang.org/[email protected]/server.go:917 +0x64
created by google.golang.org/grpc.(*Server).Serve in goroutine 429
	go/pkg/mod/google.golang.org/[email protected]/server.go:916 +0x4f4

bozaro avatar May 17 '24 07:05 bozaro

@bozaro : Thank you very much for the stack traces.

I think this is what is happening:

  • xDS enabled gRPC server gets a connection
  • This server uses a custom net.Listener which accepts the connection and returns a connWrapper here: https://github.com/grpc/grpc-go/blob/e22436abb80982dab3294250b7284eedd8f20768/xds/internal/server/listener_wrapper.go#L343
  • The gRPC server then spawns a goroutine to handle this connection here: https://github.com/grpc/grpc-go/blob/e22436abb80982dab3294250b7284eedd8f20768/server.go#L927
  • As part of handling the new connection, an HTTP/2 server transport is created here: https://github.com/grpc/grpc-go/blob/e22436abb80982dab3294250b7284eedd8f20768/server.go#L984
    • If the server transport creation fails and the returned error value is not credentials.ErrConnDispatched, the rawConn is closed, which ends up calling Close on the connWrapper (this will be the second close)
  • As part of creating the HTTP/2 server transport, the server TLS handshake is initiated here: https://github.com/grpc/grpc-go/blob/e22436abb80982dab3294250b7284eedd8f20768/internal/transport/http2_server.go#L146
  • This calls to ServerHandshake is handled by the xDS credentials implementation here: https://github.com/grpc/grpc-go/blob/e22436abb80982dab3294250b7284eedd8f20768/credentials/xds/xds.go#L207
  • It wraps the rawConn in a tls.Conn here: https://github.com/grpc/grpc-go/blob/e22436abb80982dab3294250b7284eedd8f20768/credentials/xds/xds.go#L248
  • It then performs the TLS handshake by calling the Handshake method on the tls.Conn here: https://github.com/grpc/grpc-go/blob/e22436abb80982dab3294250b7284eedd8f20768/credentials/xds/xds.go#L249
    • And if the handshake fails, then it calls Close on the tls.Conn, which ends up calling Close on the wrapped conn, which essentially lands in connWrapper (this will be the first close)

Also, I see that we have defined this error sentinel ErrConnDispatched here: https://github.com/grpc/grpc-go/blob/e22436abb80982dab3294250b7284eedd8f20768/credentials/credentials.go#L125, but none of the credentials implementations are returning this error.

I feel that the xDS credentials implementation should return this error in the case where the server handshake failed, and it ended up calling Close on the tls.Conn.

@dfawley
Do you know some history behind this error sentinel? What do you think about returning this value from the xDS creds implementation in the above case? Thanks.

easwars avatar May 17 '24 21:05 easwars

Dispatched is used to allow another protocol to handle the connection. That's not intended for this kind of use case.

Interesting sequence of events. It seems if we use TLS to wrap a connection, we need to allow for double-close (either in connWrapper or certprovider), because:

  1. we can't start requiring that if any custom ServerHandshake errors, that it closes the connection (as that would be a behavior change), and
  2. we probably shouldn't remove the conn.Close from our TLS handshaker, because not calling the Close method on a tls.Conn is probably a bad idea (even if it doesn't leak anything today).

dfawley avatar May 20 '24 15:05 dfawley

Do I need to do anything with this PR (e.g. rebase)?

bozaro avatar May 21 '24 08:05 bozaro

@dfawley : Based on our offline discussion, the changes here reflect mostly what we want. We can track the double close in the TLS code path separately (if required, and if we can do anything actionable there).

This is waiting for your second review, apart from the minor nit that I have above.

easwars avatar May 28 '24 17:05 easwars