source-controller
source-controller copied to clipboard
Issues with Air-gapped Network Installation
Over the past couple of week's I've been tasked with installing FluxV2 on EKS, on an air-gapped network. It is not completely air-gapped, in the sense that the internal image registry that we have setup (Artifactory) mirrors DockerHub and GHCR - so we are able to pull Flux controller images. However it is a private EKS cluster.
There were a couple of points and areas of friction I've encountered across this task, I have started threads in the flux Slack channel and have been communicating with Kingdon and a couple of others around the possibility in filling some voids of documentation as well as possibly raising a bug/issue around areas where we couldn't get things to work.
Apologies, if this issue has a lot of content, I am more than happy to split it out if needed, I just wanted to get everything out so people can start commenting on ways forward.
Summary of points / areas:
- Pulling images from a private image registry (Artifactory) which is a self-signed host
- Automatic Image Updates not Working due to
Certificateerrors in theimage-automation-controllerwhen trying to perform anygitactivity (this may actually be a problem with thehttpsfunctionality) - Artifactory needed some special config combination to work when treated as the HelmRepository
Pulling images from a private image registry (Artifactory) which is a self-signed host
We found that originally, we were getting x509 cert errors when trying to pull the images from Artifactory for the pods, and the reason for this was because the EKS node AMI didn't have the platform root certs baked into it. So when it was calling Artifactory, it was getting x509 unknown self signed cert errors, once we used a new AMI with the required certs loaded into the nodes trust store, we found that we could now pull the Flux controllers images correctly from Artifactory without the x509 cert errors.
However the battle was half over. Although we had Flux up and running with all of the pods healthy. When we actually specified a HelmRelease, the source-controller was failing because when it was trying to talk to Artifactory to get the HelmCharts and Images required for the HelmRelease, it had x509 unknown self-signed cert errors because the certs weren't inside the source-controller pods trust store. Basically, the pods didn't inherit their trust from the underlying node.
To get around this, similar to the AMI solution, we just put the certs into the pods trust store. How we did this was patch the source-controller by creating an initContainer for it that downloaded the root cert from our internal platform CDP and then just loaded it into the main source-controller pod under the /etc/ssl/certs directory. Then after a restart, we had no issues with the source-controller making calls to an internally signed host. This was the same for the other controllers also.
The code for this was as follows:
gotk-patches.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: source-controller
namespace: flux-system
spec:
template:
spec:
imagePullSecrets:
- name: flux-artifactory-pull-secret
containers:
- name: manager
volumeMounts:
- name: cert-volume
mountPath: /etc/ssl/certs/root_ca.crt
subPath: root_ca.crt
readOnly: false
initContainers:
- name: init-source-controller
image: artifactory.url.blah.local/platform-image-repo/busybox:1.28
command: [ 'sh', '-c', "wget -O /var/tmp/root_ca.crt http://cdp.platform.blah.local/pki_offline_root.cer"]
volumeMounts:
- name: cert-volume
mountPath: /var/tmp
volumes:
- name: cert-volume
emptyDir: {}
kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
resources:
- gotk-sync.yaml
- gotk-components.yaml
patchesStrategicMerge:
- gotk-patches.yaml
Now I'm not sure if this is something that we can add to the Flux documentation as I can imagine on domains where things are self-signed, there maybe problems experienced around not having the necessary keys in the trust store of the Flux component pods in order to call those self-signed hosts. Either that or it is built into the Flux binary itself where it adds the certs to the trust store itself as opposed to a user having to manually create patches for them as I have above.
Automatic Image Updates not Working due to Certificate errors in the image-automation-controller when trying to perform any git activity
Once we got Flux up and running and were able to specify HelmRelease's for our deployments, once things were deployed, we found that when we tagged a new image in the ImageRepository, the image-reflector-controller would detect a new image has been found, however, the image-automation-controller would error with a simple unable to clone: Certificate error. I'm assuming when it was trying to write the image update back to git, it was erroring whilst doing the clone. Now I have used Flux before with ssh and never had image update automation problems. However because we are on a specifically restricted environment, we are mandated to use https. I have a feeling there maybe something wrong with the https image update automations - but that is a feeling. The related thread on Slack for this can be found here. It's also worth mentioning that the source-controller doesn't have any issues with the cloning and it too is using https, the only difference between the SC and IAC is the fact that the IAC does a write to git as well.
Unfortunately, we couldn't get over this issue and are resorted to manual image updates - which isn't the end of the world for now. But thought I'd flag it incase the https git implementations perhaps aren't as stable as the ssh ones.
Artifactory needed some special config combination to work when pulling HelmCharts
So this one took a few hours to resolve, but Flux and Artifactory really weren't working nicely when Artifactory was being treated as the HelmRepository. As discussed on the thread linked above in Slack, it was pointed out that Kingdon on a previous issue found it was a URL issue with Artifactory which meant that the port was needed to be specified. I additionally found that I had to specify the passCredentials: true flag also.
This ended up being the full yaml for HelmRepository that worked:
---
apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: HelmRepository
metadata:
name: blah
namespace: blah
spec:
interval: 10m
url: https://artifactory.platform.url.local:443/artifactory/helm-charts/
passCredentials: true
secretRef:
name: flux-artifactory-creds
Not sure if this is worth adding to documentation? As this I can imagine will help others if they are using Artifactory to store their HelmChart's.
The above code is a desensitised version of the code - removing key names etc, but it is the same.
Again, apologies for long issue, just wanted to document the my experiences in efforts that it either results in improve docs, new features or bug fixes.
Thank you for this - I'm battling with #510 myself. I'll give your patchesStrategicMerge a go.
You might want to check this out rather than building your own AMI: tomconte/containerd-certificate-ds.yaml. It adds the CA to your hosts by using a DaemonSet.