flux2 icon indicating copy to clipboard operation
flux2 copied to clipboard

Upgrade of Flux on AKS cluster doesn't work as per the documentation

Open FofM opened this issue 2 years ago • 6 comments

Describe the bug

I tried to upgrade Flux twice from v0.27.2 to v0.30.2 on an AKS cluster running k8s version 1.23.3 (upgraded from 1.21.4) by following the documentation.

On both occasions the upgrade step resulted in flux components being wiped from the AKS cluster. I tried to redeploy it using the kubectl apply -f components.yaml, which redeployed the controllers on the cluster, however, the source controller starts failing after 5-10 min.

Could it be that the documentation is missing an important bit which I'm failing to apply in the Flux upgrade?

Steps to reproduce

Git Repo used Azure Repo

  1. Have a running AKS cluster on version 1.21.9
  2. Install Flux v0.27.2
  3. Bootstrap the cluster as described in Flux docs (Azure)
  4. Upgrade the AKS cluster to 1.22.9 and then 1.23.3
  5. Upgrade Flux CLI to v0.30.2 (latest at time of writing)
  6. Run flux install --export > .\clusters\test\flux-system\components.yaml
  7. Git add + commit + push

At this point the flux on the cluster will try to get the update from the azure git repo and it will end up removing all components from flux-system namespace and never reapply them back.

I also tried to run kubectl.exe apply -f .\clusters\test\flux-system\components.yaml afterwards to see if it will reinstall things, which looks good at first, the controller pods are back, but after a 5-10 min. period the source controller starts failing.

Event: Readiness probe failed: Get "http://10.244.2.9:9090/": dial tcp 10.244.2.9:9090: connect: connection refused

Logs:

W0512 13:15:11.209804       1 reflector.go:324] k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1beta2.HelmRepository: helmrepositories.source.toolkit.fluxcd.io is forbidden: User "system:serviceaccount:flux-system:source-controller" cannot list resource "helmrepositories" in API group "source.toolkit.fluxcd.io" at the cluster scope
E0512 13:15:11.209861       1 reflector.go:138] k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1beta2.HelmRepository: failed to list *v1beta2.HelmRepository: helmrepositories.source.toolkit.fluxcd.io is forbidden: User "system:serviceaccount:flux-system:source-controller" cannot list resource "helmrepositories" in API group "source.toolkit.fluxcd.io" at the cluster scope
W0512 13:15:23.312749       1 reflector.go:324] k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1beta2.HelmChart: helmcharts.source.toolkit.fluxcd.io is forbidden: User "system:serviceaccount:flux-system:source-controller" cannot list resource "helmcharts" in API group "source.toolkit.fluxcd.io" at the cluster scope
E0512 13:15:23.312806       1 reflector.go:138] k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1beta2.HelmChart: failed to list *v1beta2.HelmChart: helmcharts.source.toolkit.fluxcd.io is forbidden: User "system:serviceaccount:flux-system:source-controller" cannot list resource "helmcharts" in API group "source.toolkit.fluxcd.io" at the cluster scope

Expected behavior

Upgrade of Flux and all sub-components to the latest version

Screenshots and recordings

No response

OS / Distro

Windows 11

Flux version

v0.27.2 and v0.30.2

Flux check

PS C:\projects\k8s-platform-config> flux check ► checking prerequisites ✔ Kubernetes 1.23.3 >=1.20.6-0 ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.21.0 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v0.25.0 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v0.23.5 ✗ source-controller: deployment not ready ► ghcr.io/fluxcd/source-controller:v0.24.4

Git provider

Azure Repo

Container Registry provider

No response

Additional context

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

FofM avatar May 12 '22 13:05 FofM

That's very strange!

In the standard distribution or bootstrap result, the components file is called flux-system/gotk-components.yaml

Are you certain that you have overwritten it in place? In your report, the filename used is components.yaml instead.

kingdonb avatar May 18 '22 12:05 kingdonb

The bootstraping is based on the Flux doc for Azure DevOps where the filename was changed in the following code:

flux install \
  --export > ./clusters/my-cluster/flux-system/gotk-components.yaml

to ./clusters/my-cluster/flux-system/components.yaml

Does flux expect the exact name to be there "gotk-components.yaml"? Everything else works quite normal for us, it's just the upgrade of flux version that is failing. I can try to rename that file and see what happens.

FofM avatar May 18 '22 12:05 FofM

As long as it matches what you find in your kustomization.yaml as generated by this line:

cd ./clusters/my-cluster/flux-system && kustomize create --autodetect

(from the section just above https://fluxcd.io/docs/use-cases/azure/#flux-upgrade)

It does not matter what the file is named, so long as it is mentioned in kustomization.yaml – a word about that, the name of this one file IS important. No other file can be named kustomization.yaml as this filename is reserved for the Kustomize overlay. (It is a common issue that someone names their Flux kustomization kustomization.yaml and they get unexpected surprise results. But that may have nothing to do with your issue, just checking...)

kingdonb avatar May 20 '22 15:05 kingdonb

Thanks for your input @kingdonb , I had another look at file naming and we do seem to be in order there. I've done several different upgrade scenarios in the meantime and I found one that works on a test cluster. I should know by the end of the week if it will work on the existing clusters. In short:

  • The AKS version is not important, I have tried using only 1.23.3 and I can reproduce the behavior.
  • we were missing the metadata name and namespace properties (value "flux-system") in the sync,yaml file
  • most importantly simply updating the new components.yaml version to github is not enough (at least for Azure), flux will not update its components automatically. I had to run kubectl.exe apply -f .\clusters\test\flux-system\components.yaml for the component update to happen.

FofM avatar May 24 '22 11:05 FofM

@FofM this filename is wrong ./clusters/my-cluster/flux-system/components.yaml it should be ./clusters/my-cluster/flux-system/gotk-components.yaml that's why Flux downgrades itself.

stefanprodan avatar May 24 '22 11:05 stefanprodan

I've done a lot more testing and I don't think the filename plays any role in the upgrade scenario. It is mainly an issue when upgrading from version 1.27.x since there are breaking changes in 1.28.x. There were couple of times when the upgrade went successfully, but other times using the same exact steps flux deletes its pods and trying to reapply the components.yaml is partially successful because the source controller stops working after few minutes (as described in the original post). I am not sure if I will have the time to dig deeper into this one, but if I find anything useful and a working upgrade path I will post it here.

FofM avatar May 31 '22 12:05 FofM