remote-apis-sdks icon indicating copy to clipboard operation
remote-apis-sdks copied to clipboard

go/pkg/balancer uses experimental gRPC APIs

Open dfawley opened this issue 1 year ago • 15 comments

These APIs are labeled as experimental, and they will be changed in the near-ish future. Details are in https://github.com/grpc/grpc-go/issues/6472. Note that any package that uses gRPC's experimental APIs should itself be considered experimental, as a change to them would break compilation for the entire package that uses them. So we do NOT recommend the use of these features in any other libraries or applications that aren't tolerant to breakages like this.

https://github.com/googleapis/google-cloud-go may have another solution to the same type of problem, which is that it creates multiple channels and load balances between them. See https://github.com/googleapis/google-api-go-client/blob/caea95689f82049822552cd649765335123831e0/transport/grpc/pool.go#L30 for details.

dfawley avatar Aug 14 '23 20:08 dfawley

Thanks Doug. This has been working just fine so far and I'm worried about regressions if I change the implementation. I'm considering copying the relevant code over under an "internal" directory here to remove the direct dependency on the experimental API.

Would that be sufficient to remove the direct dependency on the experimental API? Do you see any licensing issues? CL: https://github.com/mrahs/remote-apis-sdks/commit/4cafa22144b4c05d93bf3406fa3462f7bc06b240

mrahs avatar Sep 09 '23 01:09 mrahs

@mrahs,

I'm considering copying the relevant code over under an "internal" directory here to remove the direct dependency on the experimental API.

Maybe I'm misunderstanding, but I don't know how this could actually work unless you fork all of grpc. You're depending on grpc's behavior as much as the APIs themselves, so just copying the APIs won't let you keep using the functionality in grpc that is behind those APIs in the client implementation.

Worst case, I can help to just update the deprecated things to the newer versions. You'd still be using a currently-experimental API, which isn't great but is no worse than the current state at least.

Are you actually running into GFE max streams limits in practice? If so you might want to adopt something like I linked in the original comment, which seems to be working for the google cloud client APIs. That avoids using any experimental APIs of gRPC's. You may even be able to use the package directly, since it's not in internal. @codyoss might be able to help with this.

Feel free to put something on my calendar any time you see an opening to discuss if desired.

dfawley avatar Sep 13 '23 16:09 dfawley

Since the grpc-go version is pinned here, I was considering the simple case of not wanting this module to depend on the upstream one. My reading of the issue is that this module will break (either at compile or runtime) when grpc-go is updated to a version where the experimental APIs have changed (in behaviour or interface). Otherwise, it should continue to work.

I'll look into addressing this properly as you suggested.

mrahs avatar Sep 15 '23 18:09 mrahs

Since the grpc-go version is pinned here

If you are a binary and not a library, then this will work for you in theory until you need to update grpc-go, or until you need to update a dependency that updates grpc-go. But in that case, no change is needed here at all.

However, since we both import to google3, this will be making my life more challenging, as we'll need to get you updated before gRPC-Go can delete the old APIs.

dfawley avatar Sep 15 '23 18:09 dfawley

Right.. this forces SDK users to use the same version until it's updated. I also forgot about the google3 copy.. The urgency of this make more sense now.

mrahs avatar Sep 15 '23 18:09 mrahs

@codyoss I'm wondering how bad connections are handled in the pool implementation in cloud libraries: https://github.com/googleapis/google-api-go-client/blob/5e4c19e2fded180e57163208fcf264d26166a99e/transport/grpc/pool.go#L30 Is the client expected to close the entire pool if one connection fails?

IIUC, the experimental balancer package defines an API for managing the lifecycle of a connection: https://github.com/grpc/grpc-go/blob/e88e8498c6df39703d1992bc4b677e7ef477145d/balancer/balancer.go#L384 which allows regenerating a connection without affecting others.

mrahs avatar Oct 19 '23 22:10 mrahs

Is the client expected to close the entire pool if one connection fails?

If a ClientConn has a lost connection, it would re-establish the connection on its own. But I'm not sure if the connection pool would skip it for dispatching RPCs - without something like that, some RPCs might error or be queued, even if there are other connections that could be used successfully instead.

the experimental balancer package defines an API for

The main goal is to avoid using that outside of gRPC until the API is stable, though.

dfawley avatar Oct 20 '23 15:10 dfawley

Thanks Doug. I think that answers my question about cloud libraries, but I'm still not sure how best to exclude connections in bad state from the pool. That balancer API helps keep track of good SubCons which the picker can choose from.

I'm considering using a similar approach to that in cloud libraries, which is by overriding Invoke and NewStream to delegate to a dispatcher before forwarding the call to the underlying ClientConn. However, I'm not yet sure how to exclude connections in a bad state.

mrahs avatar Oct 20 '23 17:10 mrahs

I'm still not sure how best to exclude connections in bad state from the pool

We do have an API to show the connectivity state of the channel as a whole. But I wonder if you can just ignore that potential problem and just re-use what the client libraries have directly? You'll always have to deal with retrying failed RPCs anyway, right? Like if you have only one channel and an RPC on it errors.

dfawley avatar Oct 23 '23 23:10 dfawley

The method I found GetState is also experimental: https://github.com/grpc/grpc-go/blob/e88e8498c6df39703d1992bc4b677e7ef477145d/clientconn.go#L738 Is there a stable alternative?

Without reacting to connection status changes there is potential to increase retry rates. I don't know how bad might that be in the wild, but I'm considering my options to mitigate that. My best bet so far is to spread retries across the pool to minimize the chance of a request getting the same bad connection. Round-robin seems like the simplest policy that achieves that, but I don't know who is relying on the current affinity-based-and-least-busy policy here and I'd rather not find out by pulling the plug on them.

mrahs avatar Oct 23 '23 23:10 mrahs

The method I found GetState is also experimental: grpc/grpc-go@e88e849/clientconn.go#L738

Ah, right. I don't think we particularly like that API since it's lossy (i.e. you can miss a temporary state), which isn't important for most use cases but also isn't really ideal. I suspect that one would be difficult to ever remove or change, unlike the balancer APIs, and there are no current plans to do so.

Without reacting to connection status changes there is potential to increase retry rates. I don't know how bad might that be in the wild

I honestly don't think it would be any worse than using a single channel with the current LB policy you are using.

RPC failures will occur for RPCs currently in flight if a backend goes down. At that point, new RPCs to that channel will not fail. The channel will go into CONNECTING, so they would be queued while the new connection is established. If no backend can be reached, then the RPCs queued on it will fail eventually, but most likely all the other channels will be down as well.

I don't know who is relying on the current affinity-based-and-least-busy policy here and I'd rather not find out by pulling the plug on them.

Interesting. Who uses the gRPC clients created with this LB policy? I had assumed it was your code. I thought you just copied this out of https://github.com/GoogleCloudPlatform/grpc-gcp-go for the connection pooling aspect, but for your own use, and because of the max concurrent streams issue you were running into on GFE.

dfawley avatar Oct 25 '23 18:10 dfawley

It turns out the code was actually copied from the gcp repo. I can test changes with bazelbuild/reclient, but I'm not sure about potential other clients. I'll simplify the policy to get rid of experimental API usage. I'll also try and make it pluggable such that if someone else really wants a more sophisticated policy they can plug their own (unless that turns into a deep rabbit hole).

mrahs avatar Apr 19 '24 20:04 mrahs

All custom LB policies use experimental code, as the balancer package itself is experimental. :)

I still recommend starting with what I mentioned in the initial comment, and let me know if that solution isn't working for any reason, and we could go from there. I'm happy to meet and discuss details in person if it helps.

dfawley avatar Apr 19 '24 21:04 dfawley

Posted #554 as an initial step. Will post another one to remove the dependent code once the new balancer is fully rolled out.

I'm not 100% sure about using the grpc.CientConnInterface based on its godocs. I also had to resort to type assertion to close the underlying grpc.ClientConn. Any tips?

mrahs avatar Apr 24 '24 15:04 mrahs

I don't think I would use that interface instead of a *grpc.ClientConn with type assertions in order to close.

Instead, I'd define my own interface like:

type grpcClientConn interface {
	grpc.ClientConnInterface
	Close() error
}

Now you know (and can verify in code) that *grpc.ClientConn implements it, as does the RRBalancer:

var _ grpcClientConn = (*grpc.ClientConn)(nil)
var _ grpcClientConn = (*balancer.RoundRobinConnPool)(nil)

Also, prefer returning concrete types over interfaces whenever possible: i.e. export RoundRobinConnPool and return it instead of grpcClientConn.

dfawley avatar Apr 24 '24 17:04 dfawley