kuma icon indicating copy to clipboard operation
kuma copied to clipboard

KDS delta sometimes drops resource kinds for a few seconds

Open nicoche opened this issue 1 year ago • 2 comments

What happened?

Sometimes, when a zone connection to the global CP is destroyed, KDS detects that some resources disappeared. After the zone connection is re-established, the resources are re-seen as existing. However, in the meantime, KDS will tell other zones that some resources have been deleted, so the zonal CP will delete them from their own database.

For example:

  • Stream cp-global <-> cp-zone1 is destroyed
  • Stream cp-global <-> cp-zone2: CP global noticed that Secrets X Y Z have been destroyed
  • Stream cp-global <-> cp-zone1 is back up
  • Stream cp-global <-> cp-zone2: CP global noticed that Secrets X Y Z have been created

Here are some logs: We have 6 zones was1 disconnects at 13:46:30 KDS logs stream cancelled at 13:46:31 KDS detects changes to Mesh in zones fra1 and sin1 (!) while nothing has changed at 13:46:32 was1 reconnects at 13:46:34 I didn't put it in the logs here after, but KDS re-detects changes to Mesh for fra1 and sin1

2024-02-22T13:46:30.997Z        INFO    kds-delta-client        ZoneToGlobalSync rpc stream stopped     {"clientID": "was1"}
2024-02-22T13:46:30.997Z        INFO    kds-delta-client        GlobalToZoneSync rpc stream stopped     {"clientID": "was1"}
2024-02-22T13:46:31.000Z        INFO    kds-service     stream cancelled        {"rpc": "Stats", "clientID": "was1"}
2024-02-22T13:46:31.000Z        INFO    kds-service     stream cancelled        {"rpc": "Clusters", "clientID": "was1"}
2024-02-22T13:46:31.000Z        INFO    kds-service     stream cancelled        {"rpc": "XDS Config Dump", "clientID": "was1"}
2024-02-22T13:46:32.552Z        INFO    kds-delta-global        detected changes in the resources. Sending changes to the client.       {"streamID": 12, "nodeID": "fra1", "resourceType": "Mesh", "client": "fra1"}
2024-02-22T13:46:32.556Z        INFO    kds-delta-global        detected changes in the resources. Sending changes to the client.       {"streamID": 9, "nodeID": "sin1", "resourceType": "Mesh", "client": "sin1"}
2024-02-22T13:46:34.713Z        INFO    kds-service     Envoy Admin RPC stream started  {"rpc": "Stats", "clientID": "was1"}
2024-02-22T13:46:34.814Z        INFO    kds-delta-global        Global To Zone new session created      {"peer-id": "was1"}
2024-02-22T13:46:34.814Z        INFO    kds-service     Envoy Admin RPC stream started  {"rpc": "Clusters", "clientID": "was1"}
2024-02-22T13:46:34.917Z        INFO    kds-service     Envoy Admin RPC stream started  {"rpc": "XDS Config Dump", "clientID": "was1"}

logs-secret-destruction.txt

More details and logs here: https://kuma-mesh.slack.com/archives/CN2GN4HE1/p1708717249211629

Kuma version: 2.5.x

nicoche avatar Feb 29 '24 14:02 nicoche

@jakubdyszkiewicz didn't you mention a recent fix that may fix this?

lahabana avatar Apr 11 '24 13:04 lahabana

It may be related, but not necessary. We had a problem that we only retry NACK once. Here is the PR https://github.com/kumahq/kuma/pull/9736

jakubdyszkiewicz avatar Apr 15 '24 14:04 jakubdyszkiewicz

xref https://github.com/kumahq/kuma/pull/10315

jakubdyszkiewicz avatar May 27 '24 14:05 jakubdyszkiewicz

Triage: we were not able to reproduce this in 2.6.x. There were changes in KDS that potentially would help. Please try the newest version. We could use some minimal repro. Please let us know if this happens with 2.6.x. We can reopen if needed

jakubdyszkiewicz avatar Jun 10 '24 14:06 jakubdyszkiewicz