compliantkubernetes-apps
compliantkubernetes-apps copied to clipboard
Incremental version checks during migration
[!warning] This is a public repository, ensure not to disclose:
- [ ] personal data beyond what is necessary for interacting with this pull request, nor
- [ ] business confidential information, such as customer names.
What kind of PR is this?
Required: Mark one of the following that is applicable:
- [x] kind/feature
- [ ] kind/improvement
- [ ] kind/deprecation
- [ ] kind/documentation
- [ ] kind/clean-up
- [ ] kind/bug
- [ ] kind/other
Optional: Mark one or more of the following that are applicable:
[!important] Breaking changes should be marked
kind/admin-changeorkind/dev-changedepending on type Critical security fixes should be marked withkind/security
- [ ] kind/admin-change
- [ ] kind/dev-change
- [ ] kind/security
- [ ] [kind/adr](set-me)
What does this PR do / why do we need this PR?
- [x] Ensure that
ck8s upgrade prepareis run beforeck8s upgrade apply - [ ] Ensure that
ck8s upgrade applyis run beforeck8s applyTODO
digraph "ck8s upgrade" {
init [label="ck8s init"]
edit [label="edit config"]
apply [label="ck8s apply"]
uprep [label="ck8s upgrade prepare"]
upgrade [label="ck8s upgrade apply"]
init -> edit -> apply -> edit
edit -> uprep -> upgrade -> apply
# init -> uprep [label="NO!"]
# edit -> init [label="NO!"]
}
This uses a pair of ConfigMaps to keep track of progress during apps upgrades ~~in order to enforce order (prepare first, then apply, and for the same version)~~ and atomicity ~~(only allow prepare once)~~ per step.
apps-metastores the current version of apps.data.versionthe version number itself
apps-upgradestores details about an in-progress upgrade, and is deleted to signify a completed migration- ~~
.data.preparerecords thatck8s upgrade ... preparehas been started~~ - ~~
.data.preparedrecords thatck8s upgrade ... preparehas completed~~ .data.preparedrecords thatck8s upgrade ... applyhas started.data.last_steprecords the last snippet run byck8s upgrade ...
- ~~
- Fixes #2315
Information to reviewers
This is intended to behave as a state machine, going from start ~~to prepare~~ to apply (for each step) to done, at which point the upgraded-to version is recorded.
How to test: I applied the changes on top of the release-0.42 branch and upgraded from 0.41.x to 0.42.1, then again to 0.43 by creating a local tag (not to be published).
WIP unit tests:
# upgrade prepare steps
make -C tests run-unit/general/bin-upgrade.bats
# version command
make -C tests run-unit/general/bin-version.bats
Checklist
- [x] Proper commit message prefix on all commits
- Change checks:
- [x] The change is transparent
- [ ] The change is disruptive
- [x] The change requires no migration steps
- [ ] The change requires migration steps
- [ ] The change updates CRDs
- [ ] The change updates the config and the schema
- Documentation checks:
- [ ] The public documentation required no updates
- [ ] The public documentation required an update - [link to change](set-me)
- Metrics checks:
- [ ] The metrics are still exposed and present in Grafana after the change
- [ ] The metrics names didn't change (Grafana dashboards and Prometheus alerts required no updates)
- [ ] The metrics names did change (Grafana dashboards and Prometheus alerts required an update)
- Logs checks:
- [ ] The logs do not show any errors after the change
- PodSecurityPolicy checks:
- [ ] Any changed Pod is covered by Kubernetes Pod Security Standards
- [ ] Any changed Pod is covered by Gatekeeper Pod Security Policies
- [ ] The change does not cause any Pods to be blocked by Pod Security Standards or Policies
- NetworkPolicy checks:
- [ ] Any changed Pod is covered by Network Policies
- [ ] The change does not cause any dropped packets in the NetworkPolicy Dashboard
- Audit checks:
- [ ] The change does not cause any unnecessary Kubernetes audit events
- [ ] The change requires changes to Kubernetes audit policy
- Falco checks:
- [ ] The change does not cause any alerts to be generated by Falco
- Bug checks:
- [ ] The bug fix is covered by regression tests
What is the motivation for keeping the phase of prepare in-cluster?
Isn't it just critical to track the apply phase?
What is the motivation for keeping the phase of prepare in-cluster?
Seemed a natural starting point for these checks.
Isn't it just critical to track the apply phase?
Yes.
https://github.com/elastisys/compliantkubernetes-apps/issues/2315#issuecomment-2416989660 is probably where I got that from, to track the whole upgrade process from the start.
Regarding the tracking of the prepare phase, I like that we are able to make sure that the config has been properly updated and get rid of the
ck8sVersionfrom the config on the other hand if there is no Kubernetes cluster to verify against then we still have to rely on something external.
But if it is in cluster it doesn't say anything, as the config isn't attached to it. Then you must apply some sort of checksumming to assert that it is that config you mean. Or we have to move the entire config in cluster. Else if someone has done prepare, then someone else without that config can run apply.
Regarding the tracking of the prepare phase, I like that we are able to make sure that the config has been properly updated and get rid of the
ck8sVersionfrom the config on the other hand if there is no Kubernetes cluster to verify against then we still have to rely on something external.But if it is in cluster it doesn't say anything, as the config isn't attached to it. Then you must apply some sort of checksumming to assert that it is that config you mean. Or we have to move the entire config in cluster. Else if someone has done prepare, then someone else without that config can run apply.
Yes you're right, the inherent issue of working with distributed configuration files still applies. At least now we can make sure that the cluster has been migrated to a new version which is a big win.
I think my main concern with the ck8sVersion config is that it leads to a false sense of security and a bad habit of thinking that just running ck8s init and you're good to go. I've seen it countless of times where the configuration validation succeeds and apply breaks because the migration was never run.
Regarding the tracking of the prepare phase, I like that we are able to make sure that the config has been properly updated and get rid of the
ck8sVersionfrom the config on the other hand if there is no Kubernetes cluster to verify against then we still have to rely on something external.But if it is in cluster it doesn't say anything, as the config isn't attached to it. Then you must apply some sort of checksumming to assert that it is that config you mean. Or we have to move the entire config in cluster. Else if someone has done prepare, then someone else without that config can run apply.
Yes you're right, the inherent issue of working with distributed configuration files still applies. At least now we can make sure that the cluster has been migrated to a new version which is a big win.
I think my main concern with the
ck8sVersionconfig is that it leads to a false sense of security and a bad habit of thinking that just runningck8s initand you're good to go. I've seen it countless of times where the configuration validation succeeds and apply breaks because the migration was never run.
If that is the concern then we should change how ck8sVersion is handled, because I do agree that this is an issue because init is not supposed to be the prepare step.
I think my main concern with the
ck8sVersionconfig is that it leads to a false sense of security and a bad habit of thinking that just runningck8s initand you're good to go. I've seen it countless of times where the configuration validation succeeds and apply breaks because the migration was never run.If that is the concern then we should change how
ck8sVersionis handled, because I do agree that this is an issue because init is not supposed to be the prepare step.
Would it be enough to add something here that checks in the apply steps that config-version == cluster version, or do we need to rethink migrations from ck8s init onwards?
I think my main concern with the
ck8sVersionconfig is that it leads to a false sense of security and a bad habit of thinking that just runningck8s initand you're good to go. I've seen it countless of times where the configuration validation succeeds and apply breaks because the migration was never run.If that is the concern then we should change how
ck8sVersionis handled, because I do agree that this is an issue because init is not supposed to be the prepare step.Would it be enough to add something here that checks in the apply steps that
config-version == cluster version, or do we need to rethink migrations fromck8s initonwards?
Or just move the update of ck8sVersion to be done as a final step when upgrade prepare completes. Init should still be able to set it in a new config though.
Could you schedule a time and we can look through it in detail?
Commented on a few changes that seemed unrelated to the original issue. Please squash into separate commits or create separate PRs to make it easier to review.
Moved some DNS related changes into https://github.com/elastisys/compliantkubernetes-apps/pull/2539
Phew, finally. Thanks for reviewing :)