Resource is constantly re-sent if NACKed
In gRPC-Go, we use the xDS management server from go-control-plane for a lot of our tests.
The way we set things up is as follows:
- We create a snapshot cache here: https://github.com/grpc/grpc-go/blob/c9caa9ed53d77155648ba8bbf2ae24b6f4fe19db/internal/testutils/xds/e2e/server.go#L139C2-L139C7
- We create the server here (and pass it the above cache): https://github.com/grpc/grpc-go/blob/c9caa9ed53d77155648ba8bbf2ae24b6f4fe19db/internal/testutils/xds/e2e/server.go#L163
- We start serving here: https://github.com/grpc/grpc-go/blob/c9caa9ed53d77155648ba8bbf2ae24b6f4fe19db/internal/testutils/xds/e2e/server.go#L185
From our tests, when we have to configure resources on the management server:
- We create a new snapshot with the resources here: https://github.com/grpc/grpc-go/blob/c9caa9ed53d77155648ba8bbf2ae24b6f4fe19db/internal/testutils/xds/e2e/server.go#L219
- We set the snapshot on the server here: https://github.com/grpc/grpc-go/blob/c9caa9ed53d77155648ba8bbf2ae24b6f4fe19db/internal/testutils/xds/e2e/server.go#L232
We have noticed that when the xDS client on our side NACKs a resource from the management server, it resends it immediately, which is NACKed again by the xDS client, and this loop continues forever.
Is there something wrong with the way we use the xDS management server?
From an xDS transport protocol perspective, I don't think it's useful for the xDS server to re-send a resource when the client NACKs it. If the client NACKs it, that means that the resource is inherently invalid, and re-sending it without modifying its contents is not going to change that -- the client will just NACK again.
I think it's useful for the xDS server to have a mechanism for reporting when clients NACK, since that is useful information. But it should not affect the actual logic of the xDS server code otherwise. The server should not treat a NACK any differently than an ACK: in either case, it should not re-send until it gets new contents for the resource.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Can one of the owners/maintainers please take a look? Thanks.
@markdroth While I agree, this is how the managed control planes work, too (e.g. Istio, TD). It used to handle a legitimate "eventual consistency" use case where parts of config are pushed outside xDS (e.g. some cert file or Wasm module). There's no true rollback. What can help is putting a rate limiter, which you can configure on the client side in Envoy "ads" configuration.
Sorry for the delayed reply. As mentioned by @kyessenov, we have use-cases when NACK do succeed on retry (the main one we've seen is listeners failing to bind at first then succeeding on retry). While those use-cases are for sure not great, this change of behavior can be impactful for users. I'm not against changing to another model, but we'd have to be careful on the user impact
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.