redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

operator: move centralized-configuration Kuttl tests to main bucket.

Open nicolaferraro opened this issue 2 years ago • 8 comments

Centralized configuration and related tests seem to be stable.

cc: @jcsp

nicolaferraro avatar Jul 25 '22 09:07 nicolaferraro

What was the outcome of investigating the most recent failures of these? (https://redpandadata.slack.com/archives/C01H6JRQX1S/p1658740986316359?thread_ts=1658736130.531769&cid=C01H6JRQX1S)

If something is still up with them, let's fix it before we reinstate them.

jcsp avatar Jul 27 '22 14:07 jcsp

k8s-unstable-tests had two failures in last 24h: FAIL test: k8s-unstable-tests.k8s-unstable-tests (2/24 runs) failure at 2022-08-02T09:37:39.714Z: None in job https://buildkite.com/redpanda/redpanda/builds/13453#01825dce-2bd3-4300-92a8-483046638bce failure at 2022-08-02T10:20:34.786Z: None in job https://buildkite.com/redpanda/redpanda/builds/13454#01825e11-5854-46fe-8699-4c617c6ce836

@nicolaferraro it would be good to get a signal on whether we have a bug here before we release 22.2

jcsp avatar Aug 03 '22 14:08 jcsp

@jcsp the logs highlight a strange behavior of the configuration system in redpanda:

2022-08-02T09:32:34.479154062Z stderr F 2022-08-02T09:32:34.479Z	INFO	controllers.redpanda.Cluster	Applying patch to the cluster configuration	{"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration", "patch": "+append_chunk_size"}
2022-08-02T09:32:34.480731898Z stderr F 2022-08-02T09:32:34.480Z	INFO	controllers.redpanda.Cluster	Patch written to the cluster	{"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration", "config_version": 4}
2022-08-02T09:32:34.481066838Z stderr F 2022-08-02T09:32:34.481Z	INFO	controllers.redpanda.Cluster	Centralized configuration hash has changed	{"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration"}
2022-08-02T09:32:34.488032683Z stderr F 2022-08-02T09:32:34.487Z	INFO	controllers.redpanda.Cluster	Node 0 restart status is false	{"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration"}
2022-08-02T09:32:34.488039338Z stderr F 2022-08-02T09:32:34.487Z	INFO	controllers.redpanda.Cluster	Node 1 restart status is false	{"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration"}
2022-08-02T09:32:34.488041858Z stderr F 2022-08-02T09:32:34.488Z	INFO	controllers.redpanda.Cluster	Node 0 is using config version 3	{"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration"}
2022-08-02T09:32:34.48804441Z stderr F 2022-08-02T09:32:34.488Z	INFO	controllers.redpanda.Cluster	Node 1 is using config version 3	{"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration"}

It's a 2 replicas cluster and the flow can be read as:

  • Operator sets a different value for append_chuck_size, the cluster returns configuration version 4
  • Operator then asks both nodes which configuration are they using and if they need restart, and they both say 3 and no need for restart

So, there might be some error in the way the query is performed, or redpanda changed the way to handle these cases of configuration changes. Wdyt?

nicolaferraro avatar Aug 04 '22 08:08 nicolaferraro

The information about which config is every node using comes from /v1/cluster_config/status, sent explicitly to the leader.

nicolaferraro avatar Aug 04 '22 08:08 nicolaferraro

So, there might be some error in the way the query is performed, or redpanda changed the way to handle these cases of configuration changes. Wdyt?

Update of status is asynchronous: there is no guarantee that the version in the status will reflect the version in the response from the PUT. It's because status updates are themselves persistent writes to the controller log, separate to the write that updates the configuration. We could make the API a bit friendlier by waiting for status updates inside the PUT handler, but that would not be 100% reliable either because it's possible for controller to lose leadership between writing the config update and writing the status update.

For testing on existing code, the solution is to have a retry-wait for the status, rather than expecting it to be updated synchronously.

Making this a bit friendlier in the API is https://github.com/redpanda-data/redpanda/issues/5833

jcsp avatar Aug 04 '22 09:08 jcsp

I remember we did some changes in the v22.1 branch to direct calls to the leader since it was supposed to apply the configuration before returning. The current operator code requires that there's some consistency between the two calls to save-config and get-config..

nicolaferraro avatar Aug 04 '22 09:08 nicolaferraro

Getting the config on the leader after setting it on the leader is synchronous, it's just the status specifically that's asynchonous.

jcsp avatar Aug 04 '22 10:08 jcsp

https://github.com/redpanda-data/redpanda/pull/5835

jcsp avatar Aug 04 '22 10:08 jcsp

Needs rebase

joejulian avatar Mar 01 '23 17:03 joejulian