go-control-plane Resource is constantly re-sent if NACKed

In gRPC-Go, we use the xDS management server from go-control-plane for a lot of our tests.

The way we set things up is as follows:

We create a snapshot cache here: https://github.com/grpc/grpc-go/blob/c9caa9ed53d77155648ba8bbf2ae24b6f4fe19db/internal/testutils/xds/e2e/server.go#L139C2-L139C7
We create the server here (and pass it the above cache): https://github.com/grpc/grpc-go/blob/c9caa9ed53d77155648ba8bbf2ae24b6f4fe19db/internal/testutils/xds/e2e/server.go#L163
We start serving here: https://github.com/grpc/grpc-go/blob/c9caa9ed53d77155648ba8bbf2ae24b6f4fe19db/internal/testutils/xds/e2e/server.go#L185

From our tests, when we have to configure resources on the management server:

We create a new snapshot with the resources here: https://github.com/grpc/grpc-go/blob/c9caa9ed53d77155648ba8bbf2ae24b6f4fe19db/internal/testutils/xds/e2e/server.go#L219
We set the snapshot on the server here: https://github.com/grpc/grpc-go/blob/c9caa9ed53d77155648ba8bbf2ae24b6f4fe19db/internal/testutils/xds/e2e/server.go#L232

We have noticed that when the xDS client on our side NACKs a resource from the management server, it resends it immediately, which is NACKed again by the xDS client, and this loop continues forever.

Is there something wrong with the way we use the xDS management server?

Jul 01 '24 21:07 easwars

From an xDS transport protocol perspective, I don't think it's useful for the xDS server to re-send a resource when the client NACKs it. If the client NACKs it, that means that the resource is inherently invalid, and re-sending it without modifying its contents is not going to change that -- the client will just NACK again.

I think it's useful for the xDS server to have a mechanism for reporting when clients NACK, since that is useful information. But it should not affect the actual logic of the xDS server code otherwise. The server should not treat a NACK any differently than an ACK: in either case, it should not re-send until it gets new contents for the resource.

Jul 01 '24 22:07 markdroth

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

Aug 01 '24 00:08 github-actions[bot]

Can one of the owners/maintainers please take a look? Thanks.

Aug 01 '24 18:08 easwars

@markdroth While I agree, this is how the managed control planes work, too (e.g. Istio, TD). It used to handle a legitimate "eventual consistency" use case where parts of config are pushed outside xDS (e.g. some cert file or Wasm module). There's no true rollback. What can help is putting a rate limiter, which you can configure on the client side in Envoy "ads" configuration.

Aug 01 '24 20:08 kyessenov

Sorry for the delayed reply. As mentioned by @kyessenov, we have use-cases when NACK do succeed on retry (the main one we've seen is listeners failing to bind at first then succeeding on retry). While those use-cases are for sure not great, this change of behavior can be impactful for users. I'm not against changing to another model, but we'd have to be careful on the user impact

Aug 01 '24 20:08 valerian-roche

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

Sep 01 '24 00:09 github-actions[bot]

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

Sep 08 '24 04:09 github-actions[bot]