envoy icon indicating copy to clipboard operation
envoy copied to clipboard

[delta-XDS] delta-XDS state tracking is broken when there is an exception thrown when watchmap loading the config.

Open stevenzzzz opened this issue 3 years ago • 5 comments

Title: state tracking is broken when there is a partial accept

Description: if there is a partial accept in Delta-XDS, i.e. route-table #50's loading throws an exception, out of 100 routeConfigurations in the batch, the RDS state is updated with the other 50 route tables not yet updated/inserted.

Even reconnection won't fix this issue, as the client believes it has the 50 not loaded route tables with some version already.

  {
    const auto scoped_update = ttl_.scopedTtlUpdate();
    if (requested_resource_state_.contains(Wildcard)) {
      for (const auto& resource : message.resources()) {
        addResourceStateFromServer(resource);
      }
    } else {
      // We are not subscribed to wildcard, so we only take resources that we explicitly requested
      // and ignore the others.
      for (const auto& resource : message.resources()) {
        if (requested_resource_state_.contains(resource.name())) {
          addResourceStateFromServer(resource);
        }
      }
    }
  }

  watch_map_.onConfigUpdate(non_heartbeat_resources, message.removed_resources(),
                            message.system_version_info());

Repro steps: See description.

stevenzzzz avatar Apr 12 '22 16:04 stevenzzzz

/cc @yanavlasov Please add a no-stale-bot to this issue.

stevenzzzz avatar Apr 12 '22 16:04 stevenzzzz

One way to resolve this would be tracking state of resources using the actual per-resource subscriptions, with the state dictionary taking resource_name->reference to the source of truth. Requires quite some plumbing tho.

stevenzzzz avatar Apr 12 '22 16:04 stevenzzzz

cc @adisuissa

adisuissa avatar Apr 12 '22 16:04 adisuissa

Envoy should probably delay the update of the tracked resource version (for non-wildcard resource and for wildcard resource) until after the call to the watch_map_.onConfigUpdate() method.

adisuissa avatar Sep 14 '22 19:09 adisuissa

Envoy should probably delay the update of the tracked resource version (for non-wildcard resource and for wildcard resource) until after the call to the watch_map_.onConfigUpdate() method.

Note that this impacts a reconnect scenario, as Envoy will send an initial_resource_version with resources and versions that may have been NACK'ed.

adisuissa avatar Sep 14 '22 19:09 adisuissa