kuma
kuma copied to clipboard
KDS delta sometimes drops resource kinds for a few seconds
What happened?
Sometimes, when a zone connection to the global CP is destroyed, KDS detects that some resources disappeared. After the zone connection is re-established, the resources are re-seen as existing. However, in the meantime, KDS will tell other zones that some resources have been deleted, so the zonal CP will delete them from their own database.
For example:
- Stream cp-global <-> cp-zone1 is destroyed
- Stream cp-global <-> cp-zone2: CP global noticed that Secrets X Y Z have been destroyed
- Stream cp-global <-> cp-zone1 is back up
- Stream cp-global <-> cp-zone2: CP global noticed that Secrets X Y Z have been created
Here are some logs: We have 6 zones was1 disconnects at 13:46:30 KDS logs stream cancelled at 13:46:31 KDS detects changes to Mesh in zones fra1 and sin1 (!) while nothing has changed at 13:46:32 was1 reconnects at 13:46:34 I didn't put it in the logs here after, but KDS re-detects changes to Mesh for fra1 and sin1
2024-02-22T13:46:30.997Z INFO kds-delta-client ZoneToGlobalSync rpc stream stopped {"clientID": "was1"}
2024-02-22T13:46:30.997Z INFO kds-delta-client GlobalToZoneSync rpc stream stopped {"clientID": "was1"}
2024-02-22T13:46:31.000Z INFO kds-service stream cancelled {"rpc": "Stats", "clientID": "was1"}
2024-02-22T13:46:31.000Z INFO kds-service stream cancelled {"rpc": "Clusters", "clientID": "was1"}
2024-02-22T13:46:31.000Z INFO kds-service stream cancelled {"rpc": "XDS Config Dump", "clientID": "was1"}
2024-02-22T13:46:32.552Z INFO kds-delta-global detected changes in the resources. Sending changes to the client. {"streamID": 12, "nodeID": "fra1", "resourceType": "Mesh", "client": "fra1"}
2024-02-22T13:46:32.556Z INFO kds-delta-global detected changes in the resources. Sending changes to the client. {"streamID": 9, "nodeID": "sin1", "resourceType": "Mesh", "client": "sin1"}
2024-02-22T13:46:34.713Z INFO kds-service Envoy Admin RPC stream started {"rpc": "Stats", "clientID": "was1"}
2024-02-22T13:46:34.814Z INFO kds-delta-global Global To Zone new session created {"peer-id": "was1"}
2024-02-22T13:46:34.814Z INFO kds-service Envoy Admin RPC stream started {"rpc": "Clusters", "clientID": "was1"}
2024-02-22T13:46:34.917Z INFO kds-service Envoy Admin RPC stream started {"rpc": "XDS Config Dump", "clientID": "was1"}
More details and logs here: https://kuma-mesh.slack.com/archives/CN2GN4HE1/p1708717249211629
Kuma version: 2.5.x
@jakubdyszkiewicz didn't you mention a recent fix that may fix this?
It may be related, but not necessary. We had a problem that we only retry NACK once. Here is the PR https://github.com/kumahq/kuma/pull/9736
xref https://github.com/kumahq/kuma/pull/10315
Triage: we were not able to reproduce this in 2.6.x. There were changes in KDS that potentially would help. Please try the newest version. We could use some minimal repro. Please let us know if this happens with 2.6.x. We can reopen if needed