envoy icon indicating copy to clipboard operation
envoy copied to clipboard

delta-xds: perform resource state tracking after resource ingestion

Open adisuissa opened this issue 3 years ago • 3 comments

Commit Message: delta-xds: perform resource state tracking after resource ingestion Additional Description: Fixes delta-xDS protocol resource state tracking. Prior to this PR, the implementation updated the state of a resource (tracked version, TTL), even if the fetched config was NACKed. This PR changes it to only update the tracked state after a resource was successfully ingested, and an ACK will be sent. Note that this PR impacts the TTL of a resource, as now TTL timer starts after the resource was ingested rather than before.

Risk Level: Medium for delta-xDS usage. Testing: Added unit tests. Docs Changes: N/A. Release Notes: Added release note. Platform Specific Features: N/A. Runtime guard: Can be reverted by setting envoy.reloadable_features.delta_xds_subscription_state_tracking_fix to false. Fixes #20790

Signed-off-by: Adi Suissa-Peleg [email protected]

adisuissa avatar Sep 22 '22 19:09 adisuissa

/assign @htuch One thing to consider whether TTL bookkeeping (starting a resource TTL timer when it is ingested) should be done before or after the resource was successfully ingested.

The TTL consideration is similar about SotW, and would like your opinion before changing that as well.

adisuissa avatar Sep 22 '22 20:09 adisuissa

Thanks for working on this.

There is still a risk tho it's much mitigated:

if there is an exception fired in the middle of accepting 1000 resources, say 500 has been updated(v0 --> V1), the 501th RouteConfiguration is rejected. Now the consumers would have v1 for the first 500 customers, but that will not be reflected in the resourceState dictionary.

The root issue is that the resourceState dict is out of sync with each individual resource's "consumer" (RDSConfigProviderImpl for example). I think instead we could possibly update the per-resource-state by calling from each resource consumer's onConfigUpdateFailed/(add the onConfigUpdateSucceeded).

Which requires quite a bit plumbing.

stevenzzzz avatar Sep 22 '22 21:09 stevenzzzz

Thanks for working on this.

There is still a risk tho it's much mitigated:

if there is an exception fired in the middle of accepting 1000 resources, say 500 has been updated(v0 --> V1), the 501th RouteConfiguration is rejected. Now the consumers would have v1 for the first 500 customers, but that will not be reflected in the resourceState dictionary.

The root issue is that the resourceState dict is out of sync with each individual resource's "consumer" (RDSConfigProviderImpl for example). I think instead we could possibly update the per-resource-state by calling from each resource consumer's onConfigUpdateFailed/(add the onConfigUpdateSucceeded).

Which requires quite a bit plumbing.

The right way to fix this is probably to make the update “transactional” or have a per-resource ACK/NACK. Otherwise Envoy may NACK a config, making the server believe that none of the resources are used, but will partially apply the config.

The current fix should address the issue of custom config validators rejecting the config, and the state tracking that is described in #20790.

adisuissa avatar Sep 22 '22 23:09 adisuissa

Ping @htuch for review

yanavlasov avatar Sep 27 '22 14:09 yanavlasov

This makes sense to me. The broader issue that @stevenzzzz points to is not really about TTL tracking but how xDS resource acceptance works in relation to ACK/NACK, so I think let's track that in a separate issue.

htuch avatar Sep 30 '22 01:09 htuch