redpanda
redpanda copied to clipboard
operator: move centralized-configuration Kuttl tests to main bucket.
Centralized configuration and related tests seem to be stable.
cc: @jcsp
What was the outcome of investigating the most recent failures of these? (https://redpandadata.slack.com/archives/C01H6JRQX1S/p1658740986316359?thread_ts=1658736130.531769&cid=C01H6JRQX1S)
If something is still up with them, let's fix it before we reinstate them.
k8s-unstable-tests had two failures in last 24h: FAIL test: k8s-unstable-tests.k8s-unstable-tests (2/24 runs) failure at 2022-08-02T09:37:39.714Z: None in job https://buildkite.com/redpanda/redpanda/builds/13453#01825dce-2bd3-4300-92a8-483046638bce failure at 2022-08-02T10:20:34.786Z: None in job https://buildkite.com/redpanda/redpanda/builds/13454#01825e11-5854-46fe-8699-4c617c6ce836
@nicolaferraro it would be good to get a signal on whether we have a bug here before we release 22.2
@jcsp the logs highlight a strange behavior of the configuration system in redpanda:
2022-08-02T09:32:34.479154062Z stderr F 2022-08-02T09:32:34.479Z INFO controllers.redpanda.Cluster Applying patch to the cluster configuration {"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration", "patch": "+append_chunk_size"}
2022-08-02T09:32:34.480731898Z stderr F 2022-08-02T09:32:34.480Z INFO controllers.redpanda.Cluster Patch written to the cluster {"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration", "config_version": 4}
2022-08-02T09:32:34.481066838Z stderr F 2022-08-02T09:32:34.481Z INFO controllers.redpanda.Cluster Centralized configuration hash has changed {"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration"}
2022-08-02T09:32:34.488032683Z stderr F 2022-08-02T09:32:34.487Z INFO controllers.redpanda.Cluster Node 0 restart status is false {"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration"}
2022-08-02T09:32:34.488039338Z stderr F 2022-08-02T09:32:34.487Z INFO controllers.redpanda.Cluster Node 1 restart status is false {"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration"}
2022-08-02T09:32:34.488041858Z stderr F 2022-08-02T09:32:34.488Z INFO controllers.redpanda.Cluster Node 0 is using config version 3 {"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration"}
2022-08-02T09:32:34.48804441Z stderr F 2022-08-02T09:32:34.488Z INFO controllers.redpanda.Cluster Node 1 is using config version 3 {"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration"}
It's a 2 replicas cluster and the flow can be read as:
- Operator sets a different value for append_chuck_size, the cluster returns configuration version 4
- Operator then asks both nodes which configuration are they using and if they need restart, and they both say 3 and no need for restart
So, there might be some error in the way the query is performed, or redpanda changed the way to handle these cases of configuration changes. Wdyt?
The information about which config is every node using comes from /v1/cluster_config/status
, sent explicitly to the leader.
So, there might be some error in the way the query is performed, or redpanda changed the way to handle these cases of configuration changes. Wdyt?
Update of status is asynchronous: there is no guarantee that the version in the status will reflect the version in the response from the PUT. It's because status updates are themselves persistent writes to the controller log, separate to the write that updates the configuration. We could make the API a bit friendlier by waiting for status updates inside the PUT handler, but that would not be 100% reliable either because it's possible for controller to lose leadership between writing the config update and writing the status update.
For testing on existing code, the solution is to have a retry-wait for the status, rather than expecting it to be updated synchronously.
Making this a bit friendlier in the API is https://github.com/redpanda-data/redpanda/issues/5833
I remember we did some changes in the v22.1 branch to direct calls to the leader since it was supposed to apply the configuration before returning. The current operator code requires that there's some consistency between the two calls to save-config and get-config..
Getting the config on the leader after setting it on the leader is synchronous, it's just the status specifically that's asynchonous.
https://github.com/redpanda-data/redpanda/pull/5835
Needs rebase