compliantkubernetes-apps Incremental version checks during migration

[!warning] This is a public repository, ensure not to disclose:

[ ] personal data beyond what is necessary for interacting with this pull request, nor

[ ] business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

[x] kind/feature
[ ] kind/improvement
[ ] kind/deprecation
[ ] kind/documentation
[ ] kind/clean-up
[ ] kind/bug
[ ] kind/other

Optional: Mark one or more of the following that are applicable:

[!important] Breaking changes should be marked kind/admin-change or kind/dev-change depending on type Critical security fixes should be marked with kind/security

[ ] kind/admin-change
[ ] kind/dev-change
[ ] kind/security
[ ] [kind/adr](set-me)

What does this PR do / why do we need this PR?

[x] Ensure that ck8s upgrade prepare is run before ck8s upgrade apply
[ ] Ensure that ck8s upgrade apply is run before ck8s apply TODO

digraph "ck8s upgrade" {
	init [label="ck8s init"]
	edit [label="edit config"]
	apply [label="ck8s apply"]
	uprep [label="ck8s upgrade prepare"]
	upgrade [label="ck8s upgrade apply"]

	init -> edit -> apply -> edit
	edit -> uprep -> upgrade -> apply

# init -> uprep [label="NO!"]
# edit -> init [label="NO!"]
}

This uses a pair of ConfigMaps to keep track of progress during apps upgrades ~~in order to enforce order (prepare first, then apply, and for the same version)~~ and atomicity ~~(only allow prepare once)~~ per step.

apps-meta stores the current version of apps
- .data.version the version number itself
apps-upgrade stores details about an in-progress upgrade, and is deleted to signify a completed migration
- ~~.data.prepare records that ck8s upgrade ... prepare has been started~~
- ~~.data.prepared records that ck8s upgrade ... prepare has completed~~
- .data.prepared records that ck8s upgrade ... apply has started
- .data.last_step records the last snippet run by ck8s upgrade ...

Fixes #2315

Information to reviewers

This is intended to behave as a state machine, going from start ~~to prepare~~ to apply (for each step) to done, at which point the upgraded-to version is recorded.

How to test: I applied the changes on top of the release-0.42 branch and upgraded from 0.41.x to 0.42.1, then again to 0.43 by creating a local tag (not to be published).

WIP unit tests:

# upgrade prepare steps
make -C tests run-unit/general/bin-upgrade.bats

# version command
make -C tests run-unit/general/bin-version.bats

Checklist

[x] Proper commit message prefix on all commits
Change checks:
- [x] The change is transparent
- [ ] The change is disruptive
- [x] The change requires no migration steps
- [ ] The change requires migration steps
- [ ] The change updates CRDs
- [ ] The change updates the config and the schema
Documentation checks:
- [ ] The public documentation required no updates
- [ ] The public documentation required an update - [link to change](set-me)
Metrics checks:
- [ ] The metrics are still exposed and present in Grafana after the change
- [ ] The metrics names didn't change (Grafana dashboards and Prometheus alerts required no updates)
- [ ] The metrics names did change (Grafana dashboards and Prometheus alerts required an update)
Logs checks:
- [ ] The logs do not show any errors after the change
PodSecurityPolicy checks:
- [ ] Any changed Pod is covered by Kubernetes Pod Security Standards
- [ ] Any changed Pod is covered by Gatekeeper Pod Security Policies
- [ ] The change does not cause any Pods to be blocked by Pod Security Standards or Policies
NetworkPolicy checks:
- [ ] Any changed Pod is covered by Network Policies
- [ ] The change does not cause any dropped packets in the NetworkPolicy Dashboard
Audit checks:
- [ ] The change does not cause any unnecessary Kubernetes audit events
- [ ] The change requires changes to Kubernetes audit policy
Falco checks:
- [ ] The change does not cause any alerts to be generated by Falco
Bug checks:
- [ ] The bug fix is covered by regression tests

Jan 17 '25 13:01 Zash

What is the motivation for keeping the phase of prepare in-cluster?

Isn't it just critical to track the apply phase?

Jan 22 '25 12:01 aarnq

What is the motivation for keeping the phase of prepare in-cluster?

Seemed a natural starting point for these checks.

Isn't it just critical to track the apply phase?

Yes.

Jan 22 '25 13:01 Zash

https://github.com/elastisys/compliantkubernetes-apps/issues/2315#issuecomment-2416989660 is probably where I got that from, to track the whole upgrade process from the start.

Jan 22 '25 13:01 Zash

Regarding the tracking of the prepare phase, I like that we are able to make sure that the config has been properly updated and get rid of the ck8sVersion from the config on the other hand if there is no Kubernetes cluster to verify against then we still have to rely on something external.

But if it is in cluster it doesn't say anything, as the config isn't attached to it. Then you must apply some sort of checksumming to assert that it is that config you mean. Or we have to move the entire config in cluster. Else if someone has done prepare, then someone else without that config can run apply.

Jan 24 '25 09:01 aarnq

Regarding the tracking of the prepare phase, I like that we are able to make sure that the config has been properly updated and get rid of the ck8sVersion from the config on the other hand if there is no Kubernetes cluster to verify against then we still have to rely on something external.

But if it is in cluster it doesn't say anything, as the config isn't attached to it. Then you must apply some sort of checksumming to assert that it is that config you mean. Or we have to move the entire config in cluster. Else if someone has done prepare, then someone else without that config can run apply.

Yes you're right, the inherent issue of working with distributed configuration files still applies. At least now we can make sure that the cluster has been migrated to a new version which is a big win.

I think my main concern with the ck8sVersion config is that it leads to a false sense of security and a bad habit of thinking that just running ck8s init and you're good to go. I've seen it countless of times where the configuration validation succeeds and apply breaks because the migration was never run.

Jan 24 '25 15:01 simonklb

Regarding the tracking of the prepare phase, I like that we are able to make sure that the config has been properly updated and get rid of the ck8sVersion from the config on the other hand if there is no Kubernetes cluster to verify against then we still have to rely on something external.

But if it is in cluster it doesn't say anything, as the config isn't attached to it. Then you must apply some sort of checksumming to assert that it is that config you mean. Or we have to move the entire config in cluster. Else if someone has done prepare, then someone else without that config can run apply.

Yes you're right, the inherent issue of working with distributed configuration files still applies. At least now we can make sure that the cluster has been migrated to a new version which is a big win.

I think my main concern with the ck8sVersion config is that it leads to a false sense of security and a bad habit of thinking that just running ck8s init and you're good to go. I've seen it countless of times where the configuration validation succeeds and apply breaks because the migration was never run.

If that is the concern then we should change how ck8sVersion is handled, because I do agree that this is an issue because init is not supposed to be the prepare step.

Feb 12 '25 09:02 aarnq

I think my main concern with the ck8sVersion config is that it leads to a false sense of security and a bad habit of thinking that just running ck8s init and you're good to go. I've seen it countless of times where the configuration validation succeeds and apply breaks because the migration was never run.

If that is the concern then we should change how ck8sVersion is handled, because I do agree that this is an issue because init is not supposed to be the prepare step.

Would it be enough to add something here that checks in the apply steps that config-version == cluster version, or do we need to rethink migrations from ck8s init onwards?

Feb 19 '25 13:02 Zash

I think my main concern with the ck8sVersion config is that it leads to a false sense of security and a bad habit of thinking that just running ck8s init and you're good to go. I've seen it countless of times where the configuration validation succeeds and apply breaks because the migration was never run.

If that is the concern then we should change how ck8sVersion is handled, because I do agree that this is an issue because init is not supposed to be the prepare step.

Would it be enough to add something here that checks in the apply steps that config-version == cluster version, or do we need to rethink migrations from ck8s init onwards?

Or just move the update of ck8sVersion to be done as a final step when upgrade prepare completes. Init should still be able to set it in a new config though.

Feb 24 '25 07:02 aarnq

Could you schedule a time and we can look through it in detail?

Apr 11 '25 11:04 aarnq

Commented on a few changes that seemed unrelated to the original issue. Please squash into separate commits or create separate PRs to make it easier to review.

Moved some DNS related changes into https://github.com/elastisys/compliantkubernetes-apps/pull/2539

May 21 '25 12:05 Zash

Phew, finally. Thanks for reviewing :)

Jun 04 '25 12:06 Zash

Jun 04 '25 12:06 simonklb

compliantkubernetes-apps compliantkubernetes-apps copied to clipboard

Incremental version checks during migration

What kind of PR is this?

What does this PR do / why do we need this PR?

Information to reviewers

Checklist

compliantkubernetes-apps
compliantkubernetes-apps copied to clipboard