source-controller icon indicating copy to clipboard operation
source-controller copied to clipboard

Issues with Air-gapped Network Installation

Open ChrisJBurns opened this issue 4 years ago • 1 comments

Over the past couple of week's I've been tasked with installing FluxV2 on EKS, on an air-gapped network. It is not completely air-gapped, in the sense that the internal image registry that we have setup (Artifactory) mirrors DockerHub and GHCR - so we are able to pull Flux controller images. However it is a private EKS cluster.

There were a couple of points and areas of friction I've encountered across this task, I have started threads in the flux Slack channel and have been communicating with Kingdon and a couple of others around the possibility in filling some voids of documentation as well as possibly raising a bug/issue around areas where we couldn't get things to work.

Apologies, if this issue has a lot of content, I am more than happy to split it out if needed, I just wanted to get everything out so people can start commenting on ways forward.

Summary of points / areas:

  • Pulling images from a private image registry (Artifactory) which is a self-signed host
  • Automatic Image Updates not Working due to Certificate errors in the image-automation-controller when trying to perform any git activity (this may actually be a problem with the https functionality)
  • Artifactory needed some special config combination to work when treated as the HelmRepository

Pulling images from a private image registry (Artifactory) which is a self-signed host We found that originally, we were getting x509 cert errors when trying to pull the images from Artifactory for the pods, and the reason for this was because the EKS node AMI didn't have the platform root certs baked into it. So when it was calling Artifactory, it was getting x509 unknown self signed cert errors, once we used a new AMI with the required certs loaded into the nodes trust store, we found that we could now pull the Flux controllers images correctly from Artifactory without the x509 cert errors. However the battle was half over. Although we had Flux up and running with all of the pods healthy. When we actually specified a HelmRelease, the source-controller was failing because when it was trying to talk to Artifactory to get the HelmCharts and Images required for the HelmRelease, it had x509 unknown self-signed cert errors because the certs weren't inside the source-controller pods trust store. Basically, the pods didn't inherit their trust from the underlying node.

To get around this, similar to the AMI solution, we just put the certs into the pods trust store. How we did this was patch the source-controller by creating an initContainer for it that downloaded the root cert from our internal platform CDP and then just loaded it into the main source-controller pod under the /etc/ssl/certs directory. Then after a restart, we had no issues with the source-controller making calls to an internally signed host. This was the same for the other controllers also.

The code for this was as follows: gotk-patches.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
   name: source-controller
   namespace: flux-system
spec:
  template:
    spec:
      imagePullSecrets:
        - name: flux-artifactory-pull-secret
      containers:
        - name: manager
          volumeMounts:
            - name: cert-volume
              mountPath: /etc/ssl/certs/root_ca.crt
              subPath: root_ca.crt
              readOnly: false
      initContainers:
      - name: init-source-controller
        image: artifactory.url.blah.local/platform-image-repo/busybox:1.28
        command: [ 'sh', '-c', "wget -O /var/tmp/root_ca.crt http://cdp.platform.blah.local/pki_offline_root.cer"]
        volumeMounts:
          - name: cert-volume
            mountPath: /var/tmp
        
      volumes:
        - name: cert-volume
          emptyDir: {}

kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
resources:
- gotk-sync.yaml
- gotk-components.yaml
patchesStrategicMerge:
- gotk-patches.yaml

Now I'm not sure if this is something that we can add to the Flux documentation as I can imagine on domains where things are self-signed, there maybe problems experienced around not having the necessary keys in the trust store of the Flux component pods in order to call those self-signed hosts. Either that or it is built into the Flux binary itself where it adds the certs to the trust store itself as opposed to a user having to manually create patches for them as I have above.

Automatic Image Updates not Working due to Certificate errors in the image-automation-controller when trying to perform any git activity Once we got Flux up and running and were able to specify HelmRelease's for our deployments, once things were deployed, we found that when we tagged a new image in the ImageRepository, the image-reflector-controller would detect a new image has been found, however, the image-automation-controller would error with a simple unable to clone: Certificate error. I'm assuming when it was trying to write the image update back to git, it was erroring whilst doing the clone. Now I have used Flux before with ssh and never had image update automation problems. However because we are on a specifically restricted environment, we are mandated to use https. I have a feeling there maybe something wrong with the https image update automations - but that is a feeling. The related thread on Slack for this can be found here. It's also worth mentioning that the source-controller doesn't have any issues with the cloning and it too is using https, the only difference between the SC and IAC is the fact that the IAC does a write to git as well.

Unfortunately, we couldn't get over this issue and are resorted to manual image updates - which isn't the end of the world for now. But thought I'd flag it incase the https git implementations perhaps aren't as stable as the ssh ones.

Artifactory needed some special config combination to work when pulling HelmCharts So this one took a few hours to resolve, but Flux and Artifactory really weren't working nicely when Artifactory was being treated as the HelmRepository. As discussed on the thread linked above in Slack, it was pointed out that Kingdon on a previous issue found it was a URL issue with Artifactory which meant that the port was needed to be specified. I additionally found that I had to specify the passCredentials: true flag also.

This ended up being the full yaml for HelmRepository that worked:

---
apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: HelmRepository
metadata:
  name: blah
  namespace: blah
spec:
  interval: 10m
  url: https://artifactory.platform.url.local:443/artifactory/helm-charts/
  passCredentials: true
  secretRef:
    name: flux-artifactory-creds

Not sure if this is worth adding to documentation? As this I can imagine will help others if they are using Artifactory to store their HelmChart's.

The above code is a desensitised version of the code - removing key names etc, but it is the same.

Again, apologies for long issue, just wanted to document the my experiences in efforts that it either results in improve docs, new features or bug fixes.

ChrisJBurns avatar Dec 02 '21 20:12 ChrisJBurns

Thank you for this - I'm battling with #510 myself. I'll give your patchesStrategicMerge a go.

You might want to check this out rather than building your own AMI: tomconte/containerd-certificate-ds.yaml. It adds the CA to your hosts by using a DaemonSet.

brovoca avatar Dec 03 '21 10:12 brovoca