Remove the dependency on grpc-go's experimental API (e.g. resolver & balancer)
What would you like to be added?
etcd is depending on some experimental APIs of grpc-go, such as resolver. Once these APIs change in future, it might break etcd. So we should try to remove the dependency on any experimental APIs; instead, let's try to use stable APIs.
References:
- https://github.com/grpc/grpc-go/pull/5748#issuecomment-1385964654
- https://github.com/grpc/grpc-go/pull/5748#issuecomment-1387426476
- https://github.com/etcd-io/etcd/issues/19700#issuecomment-2779240534
We should have
- a summary on existing design/implementation first
- proposal(s) to replace them, and with pros & cons. invite grpc's experts to review.
- evaluate the impact on etcd itself and users of etcd client sdk.
Mitigations:
- https://github.com/etcd-io/etcd/pull/19780
- https://github.com/etcd-io/etcd/issues/19706
- https://github.com/etcd-io/etcd/pull/19785
- https://github.com/etcd-io/etcd/pull/19825
- mark the gRPC naming and discovery feature as experimental, see also https://github.com/etcd-io/etcd/issues/19700#issuecomment-2821703505
- https://github.com/etcd-io/website/issues/990
Long-term solution?
- use the GCP gRPC connection pool ?
- https://github.com/etcd-io/etcd/issues/19700#issuecomment-2815923497
Related followup:
-
An endpoint may have multiple addresses: i.e. IPv4 and IPv6. Currently we only assume each endpoint only has one address
- https://github.com/etcd-io/etcd/issues/19700#issuecomment-2819173767
- https://github.com/etcd-io/etcd/pull/19780
-
https://github.com/etcd-io/etcd/issues/19772
Why is this needed?
Improve stability
grpc-go TL here. I'm happy to work with you to find / provide a stable solution, but I would first need to know more about your use cases & requirements.
just to randomly add what I've found in https://github.com/etcd-io/etcd/issues/14501 - once we reconsider the balancer/resolver, we should also add some circuit breaking. There were some ideas by Easwar in https://github.com/grpc/grpc-go/issues/5672.
Sorry for the late response.
The use cases & requirement seems to be simple. Users may specify multiple endpoints on the client side (see example below), and etcd clientv3 needs to support client side load balancing.
$ ./etcdctl --endpoints=http://127.0.0.1:2379,http://127.0.0.1:22379,http://127.0.0.1:32379 get k1
Currently the target used when calling grpc.DialContext is a string in format etcd-endpoints://<pointer>/<authority>, e.g. etcd-endpoints://0xc00036e1e0/127.0.0.1:2379. etcd implements a customized resolver , which is passed to gRPC via grpc.WithResolvers. The loadBalancingPolicy is round_robin.
Some historical background:
- Previously etcd 3.4 implemented a custom load balancer, and it was changed to gRPC's Resolver and load balancing in 3.5.0 in https://github.com/etcd-io/etcd/pull/12671.
- But unfortunately https://github.com/etcd-io/etcd/pull/12671 introduced a critical issue https://github.com/etcd-io/etcd/issues/13192 due to invalid
:authorityheader, which was fixed in https://github.com/etcd-io/etcd/pull/13359
Based on https://github.com/grpc/grpc/blob/master/doc/naming.md, it seems the gRPC resolver plugin (although the API is still experimental) is still the right direction? If not, could you please advise the correct approach? cc @dfawley
@dfawley could you advise how to replace the experimental API mentioned above?
One more dependency on experimental API: grpcServer.ServeHTTP
Also sorry for the late rely
Based on grpc/grpc@master/doc/naming.md, it seems the gRPC resolver plugin (although the API is still experimental) is still the right direction?
For this you could instead create (N) gRPC clients, one for each address, and do round robin dispatch to them by wrapping them in another object that implements its own grpc.ClientConnInterface and has a pool of the grpc.ClientConns created. This is what is done by Google's cloud client libraries to spread load across multiple connections.
What would be your concerns with such an approach?
What would be your concerns with such an approach?
Thanks for the response. The only concern is what (etcd or gRPC?) is the best place to implement it. It looks like a generic client side load balancer solution, does it make more sense to get it wrapped in grpc/grpc-go instead of etcd?
Also some APIs (such as resolver.Resolve, balancer.Balancer , etc) were added as experimental APIs about 6 years ago, they are still experimental APIs today. Have you considered to graduate them to stable APIs or deprecate/remove them and provide alternative solution/APIs?
cc @mitake @ptabor @serathius @spzala @liggitt @wojtek-t @lavalamp for opinions.
I was just digging around and saw that C-core has the following name resolver (https://github.com/grpc/grpc/blob/master/doc/naming.md#name-syntax):
ipv4:address[:port][,address[:port],...]-- IPv4 addresses
- Can specify multiple comma-delimited addresses of the form
address[:port]:
addressis the IPv4 address to use.portis the port to use. If not specified, 443 is used.
We could also explore making this a first-class name resolver. Would that take care of all of your need for resolver and balancer APIs?
We could also explore making this a first-class name resolver. Would that take care of all of your need for
resolverandbalancerAPIs?
My immediate feeling is it looks good. Could you please provide a rough draft API and usage, so that I can evaluate the change & effort in etcd side?
One aspect to be aware of is also support for DNS resolution:
I think that currently such connection works:
./etcdctl --endpoints=http://etcd0.domain:2379,http://etcd1.domain:2379,http://etcd2.domain:2379 get k1
And in case of problems the DNS name get's re-resolved to IP.
Fallback to ipv4:address[:port][,address[:port],...] sound like doing the DNS binding once. In some dynamically scheduled environment the etcd instance can be rescheduled and name be rebound. This would be use-case that we are explicitly dropping.
Alternative is to build in within an etcd client a local xDS policy supplier.
Yeah I was digging a little deeper and it seems you need more than multiple IP address support; you also need DNS names and multiple unix sockets, and mixed versions of all these things.
We have one project on the horizon that is going to require changes to the resolver API, and then I think we will be able to stabilize it. Unfortunately, that will first require a breaking change, which is unavoidable no matter what. In the meantime, I think the only real option would be to maintain a pool of channels, which I agree isn't trivial but also isn't something we probably would put in the main grpc-go repo, and we don't have time to implement for now. You can find another minimal implementation here: https://github.com/googleapis/google-api-go-client/blob/3f4d679d4dae5de7ce124a7db68ef5147c9a3fa2/transport/grpc/pool.go. Ideally it would keep track of the state of the ClientConns and skip the non-READY ones, but that may not be a strict requirement for you either.
Thanks @dfawley for the feedback.
We have one project on the horizon that is going to require changes to the resolver API, and then I think we will be able to stabilize it.
Could you provide the rough timeline on when the resolve API can be stabilized, and which grpc-go version?
I think the only real option would be to maintain a pool of channels, which I agree isn't trivial but also isn't something we probably would put in the main grpc-go repo, and we don't have time to implement for now.
I think we need to do some investigation & PoC firstly. I do not have time to dig into it for now. So labelled this issue with "help wanted".
Hi @ahrtr I am interested in this and will provide a minimal working example/POC soon. After that, I think we can go through some any necessary processes like design, collect must to have / nice to have requirements, test verification as success criteria before the implementation, production readiness check etc before the actual migration.
For example:
Must to have
- [ ] Supports Unix & DNS & IPV4 & IPV6 with round robin similar load balancing client, initially it is insecure.
- [ ] grpc-go supports DNS resolver (by default), unix socket resolver (only for testing purpose) and accept ipv4 and ipv6 targets with passthrough resolver.
- [ ] Keep track of the state of the ClientConns and skip the non-READY ones.
- [ ] E2E test the above scenarios.
- [ ] Supports secure connection with TLS
- [ ] The new etcd client should have at least the same performance as the old one. Measuring by etcdctl check perf command, etc.
- [ ] Existing e2e test cases like TestAuthority won’t fail unexpectedly.
- [ ] Adding new test cases that current test suite missed.
Nice to have
- [ ] Replace all of clientv3/retry.go with RetryPolicy
- [ ] circuit breaking on slowest endpoint, not just not READY connections https://github.com/etcd-io/etcd/issues/15145#issuecomment-1413656019
- [ ] lowest latency affinity load balancing algorithm feature request https://github.com/etcd-io/etcd/issues/15918
Thanks @chaochn47 . Assigned to you.
2023-06-08 update:
I took a little deeper look today and found out that the generated gRPC API is outdated most likely because of https://github.com/etcd-io/etcd/issues/14533 https://github.com/etcd-io/etcd/blob/b4f49a55a5aba85a30cae5742a87989132a0f750/api/etcdserverpb/rpc.pb.go#L6509-L6520 NewKVClient(cc grpc.ClientConn) could have been written with NewKVClient(cc grpc.ClientConnInterface) so that the proposed ConnPool interface and its implementation roundRobinConnPool can be passed into it for customization.
This latter NewKVClient(cc grpc.ClientConnInterface) has been widely used in google-cloud-go repository, for example the spanner client.
So I can start looking into rpc.proto and use latest version of protoc go plugin to re-generate rpc.pb.go
Just discovered https://github.com/etcd-io/etcd/blob/9e1abbab6e4ba2886238f49b0d48fc19a546c7cf/client/v3/credentials/credentials.go#L37 is another gRPC experimental API. However, I don't see why we need this API. etcd does not support mode switching from transport security to per RPC credentials (auth) or backwards from my understanding. It should be a easy clean up.
Could you provide the rough timeline on when the resolve API can be stabilized, and which grpc-go version?
https://github.com/grpc/grpc-go/issues/6472
The breaking changes are upcoming (within ~6 months?), but I'm pretty sure we can find a way to enable a migration path that doesn't break suddenly in a single version, and provide a couple releases of overlap. Full details haven't been planned, but some changes are already underway (pending PRs: https://github.com/grpc/grpc-go/pull/6471 and https://github.com/grpc/grpc-go/pull/6481)
I took a little deeper look today and found out that the generated gRPC API is outdated most likely because of #14533
We need to bump both protobuf and grpc go plugin eventually. But is this task blocked by it?
EDIT: just noticed that your following comment(pasted below). Indeed, the outdated grpc go plugin has some impact on this task. But it isn't a big problem, we can also change grpc.ClientConnInterface to grpc.ClientConn for now to workaround it.
NewKVClient(cc grpc.ClientConn) could have been written with NewKVClient(cc grpc.ClientConnInterface) so that the proposed ConnPool interface and its implementation roundRobinConnPool can be passed into it for customization.
The breaking changes are upcoming (within ~6 months?), but I'm pretty sure we can find a way to enable a migration path that doesn't break suddenly in a single version, and provide a couple releases of overlap. Full details haven't been planned, but some changes are already underway (pending PRs: grpc/grpc-go#6471 and grpc/grpc-go#6481)
Thanks @dfawley for the feedback. It seems that the already planed items in https://github.com/grpc/grpc-go/issues/6472 will not impact etcd at all. But as you mentioned Full details haven't been planned, so in theory, there is still some potential risks.
But etcd's use of the grpc-go's resolver package is really very simple and basic, the golang source file has just dozens of lines of code. Please also see https://github.com/etcd-io/etcd/issues/15145#issuecomment-1436554664.
Each time when there is any change (e.g. adding or removing server, or update server's address) on the backend servers, etcd client will construct a new resolver.State object which contains the latest list of server addresses, and then call (*Resolver) UpdateState. Do you still think https://github.com/grpc/grpc-go/issues/6472 might break it in future?
You can find another minimal implementation here: https://github.com/googleapis/google-api-go-client/blob/3f4d679d4dae5de7ce124a7db68ef5147c9a3fa2/transport/grpc/pool.go. Ideally it would keep track of the state of the ClientConns and skip the non-READY ones, but that may not be a strict requirement for you either.
It's non-trivial effort. If we follow this approach, then we might want to create a separate small project something like github.com/etcd-io/grpc-client-load-balancer or github.com/etcd-io/go-client-load-balancer, so as not to complicate or pollute etcd client's implementation. But as I mentioned in https://github.com/etcd-io/etcd/issues/15145#issuecomment-1491160332, I really think it's unfortunate if we have to spend too much effort to do it ourselves.
Also reference:
- https://grpc.io/blog/grpc-load-balancing/
Just discovered
https://github.com/etcd-io/etcd/blob/9e1abbab6e4ba2886238f49b0d48fc19a546c7cf/client/v3/credentials/credentials.go#L37
is another gRPC experimental API. However, I don't see why we need this API. etcd does not support mode switching from transport security to per RPC credentials (auth) or backwards from my understanding. It should be a easy clean up.
Good catch. Just raised a PR to remove it. https://github.com/etcd-io/etcd/pull/16358
Agreed with @ahrtr's point. I think the essential etcd usage of this experimental resolver API is built on top of manual resolver. It's quite simple. gRPC-go should be able to support in my humble opinion. (Even though it is marked as experimental indicating it's subject to change in the future).
There might be other experimental usages like WithResolvers() but etcd should also be able to migrate to register the resolver builder with resolver.Register just like the default DNS resolver.
We need to bump both protobuf and grpc go plugin eventually. But is this task blocked by it?
No, I am just hesitant if we go down in this route proposed in https://github.com/etcd-io/etcd/issues/15145#issuecomment-1499550450 as I am implementing, it is essentially re-invent what gRPC-go implements between ClientConn and SubConn about state synchronization, health checking, etc..
Endpoints was added into State in https://github.com/grpc/grpc-go/pull/6471, and Addresses will be deprecated and replaced fully by Endpoints, so etcd should also make change to our customized resolver accordingly.
Since the gRPC resolver package is still experimental, so we don't need to rush to make the change for now.
Hi @ahrtr Based on recent PoC of gRPC LB health check and outlier detection balancer integration with etcd client, I propose not to remove the usage of gRPC resolver & balancer.
just to randomly add what I've found in https://github.com/etcd-io/etcd/issues/14501 - once we reconsider the balancer/resolver, we should also add some circuit breaking. There were some ideas by Easwar in https://github.com/grpc/grpc-go/issues/5672.
I favor to keep the opportunity of improving etcd client with better availability while keeping an eye on the gRPC API changes.
Note there is an ongoing project in gRPC-go to Promote Resolver/Balancer API to Stable
I propose not to remove the usage of gRPC resolver & balancer.
Yes, just as I mentioned in https://github.com/etcd-io/etcd/issues/15145#issuecomment-1662486867, etcd's usege of the grpc-go's resolver package is really very simple and basic; and it's non-trivial effort to re-invent the wheel. So there is no plan to remove the gRPC resolver & balancer.
But we need to followup the grpc's change, and may need to make change something like https://github.com/etcd-io/etcd/issues/15145#issuecomment-1717168948
But we need to followup the grpc's change, and may need to make change something like https://github.com/etcd-io/etcd/issues/15145#issuecomment-1717168948
+100 Totally agree
Another deprecated API usage.
gRPC v1.44.0 change log, no behavior changes. But it reveals that we should replace all grpc.WithInsecure() with grpc.WithTransportCredentials(insecure.NewCredentials()) based on https://github.com/grpc/grpc-go/pull/5068 in 3.4/3.5/main
gRPC v1.44.0 change log, no behavior changes. But it reveals that we should replace all grpc.WithInsecure() with grpc.WithTransportCredentials(insecure.NewCredentials()) based on grpc/grpc-go#5068 in 3.4/3.5/main
Also consider that you shouldn't use "insecure" credentials at all. Tests should be okay, though they should also be OK to use local credentials. Production code should probably also prefer local credentials (or TLS/etc), which verifies that the address is local and not over a network: https://pkg.go.dev/google.golang.org/[email protected]/credentials/local#NewCredentials
Yes, this is unfortunately also experimental. I believe we can promote it to stable, as I'm not aware of any issues.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
This is still very relevant for us and for your users.