flux2 icon indicating copy to clipboard operation
flux2 copied to clipboard

Adding a proxy to the source-controller causes the controller not to start

Open romogo17 opened this issue 1 year ago • 11 comments

Describe the bug

I have an EKS cluster that can only access the internet through an HTTP proxy. When I add a HelmRepository, the controller cannot fetch it by default

~ kubectl get helmrepository -n flux-system
NAME      URL                                AGE     READY   STATUS
aws-eks   https://aws.github.io/eks-charts   6h42m   False   failed to fetch Helm repository index: failed to cache index to temporary file: Get "https://aws.github.io/eks-charts/index.yaml": dial tcp 185.199.108.153:443: i/o timeout

I tried to follow the Bootstrap cheatsheet, which has instructions to patch the controllers to add a proxy

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - gotk-components.yaml
  - gotk-sync.yaml
patches:
  - patch: |
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: all
      spec:
        template:
          spec:
            containers:
              - name: manager
                env:
                  - name: "HTTPS_PROXY"
                    value: "http://my-proxy-host:9595"
                  - name: "NO_PROXY"
                    value: "localhost,127.0.0.1,10.0.0.0/8,.internal,.cluster.local.,.cluster.local,.svc"
    target:
      kind: Deployment
      labelSelector: app.kubernetes.io/part-of=flux
      name: "source-controller"

This seems to be getting applied as expected when I bootstrap it:

kubectl describe deploy -n flux-system source-controller
Name:               source-controller
Namespace:          flux-system
...
    Environment:
      HTTPS_PROXY:        http://my-proxy-host:9595
      NO_PROXY:          localhost,127.0.0.1,10.0.0.0/8,.internal,.cluster.local.,.cluster.local,.svc
      RUNTIME_NAMESPACE:   (v1:metadata.namespace)
      TUF_ROOT:           /tmp/.sigstore
...

However, when I try to bootstrap flux with this patch, the source controller pods don't start anymore, the liveness and readiness probes have connection refused issues

☁  ~  k describe pod -n flux-system source-controller-5c9f7f6d6f-24fs9
Name:             source-controller-5c9f7f6d6f-24fs9
Namespace:        flux-system
Priority:         0
Service Account:  source-controller
Node:             ip-10-5-97-151.ec2.internal/10.5.97.151
Start Time:       Sat, 08 Oct 2022 06:38:25 -0600
Labels:           app=source-controller
                  pod-template-hash=5c9f7f6d6f
Annotations:      container.seccomp.security.alpha.kubernetes.io/manager: runtime/default
                  kubernetes.io/psp: eks.privileged
                  prometheus.io/port: 8080
                  prometheus.io/scrape: true
Status:           Running
IP:               10.5.97.146
IPs:
  IP:           10.5.97.146
Controlled By:  ReplicaSet/source-controller-5c9f7f6d6f
Containers:
  manager:
    Container ID:  containerd://336f9b12c5c05ec9e6e9f787a9f86fb85b0109b5c68c647ad9b551cdcc1ab786
    Image:         ghcr.io/fluxcd/source-controller:v0.30.0
    Image ID:      ghcr.io/fluxcd/source-controller@sha256:afd1ceb08de3e9072a3d260604b04a985ff0798031b016519912d6ede28d2533
    Ports:         9090/TCP, 8080/TCP, 9440/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      --events-addr=http://notification-controller.flux-system.svc.cluster.local./
      --watch-all-namespaces=true
      --log-level=info
      --log-encoding=json
      --enable-leader-election
      --storage-path=/data
      --storage-adv-addr=source-controller.$(RUNTIME_NAMESPACE).svc.cluster.local.
    State:          Running
      Started:      Sat, 08 Oct 2022 06:39:25 -0600
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Sat, 08 Oct 2022 06:38:55 -0600
      Finished:     Sat, 08 Oct 2022 06:39:25 -0600
    Ready:          False
    Restart Count:  2
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      50m
      memory:   64Mi
    Liveness:   http-get http://:healthz/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:http/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      HTTPS_PROXY:        http://my-proxy-host:9595
      NO_PROXY:            localhost,127.0.0.1,10.0.0.0/8,.internal,.cluster.local.,.cluster.local,.svc
      RUNTIME_NAMESPACE:  flux-system (v1:metadata.namespace)
      TUF_ROOT:           /tmp/.sigstore
    Mounts:
      /data from data (rw)
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2jtx7 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-2jtx7:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  64s                default-scheduler  Successfully assigned flux-system/source-controller-5c9f7f6d6f-24fs9 to ip-10-5-97-151.ec2.internal
  Normal   Started    34s (x2 over 63s)  kubelet            Started container manager
  Normal   Pulled     4s (x3 over 64s)   kubelet            Container image "ghcr.io/fluxcd/source-controller:v0.30.0" already present on machine
  Normal   Created    4s (x3 over 64s)   kubelet            Created container manager
  Warning  Unhealthy  4s (x9 over 62s)   kubelet            Readiness probe failed: Get "http://10.5.97.146:9090/": dial tcp 10.5.97.146:9090: connect: connection refused
  Warning  Unhealthy  4s (x6 over 54s)   kubelet            Liveness probe failed: Get "http://10.5.97.146:9440/healthz": dial tcp 10.5.97.146:9440: connect: connection refused
  Normal   Killing    4s (x2 over 34s)   kubelet            Container manager failed liveness probe, will be restarted

The pods also don't output any logs

Steps to reproduce

  1. Bootstrap Flux with an EKS cluster with no internet access
  2. Patch the soucer-controller per the instructions to patch the controllers to add a proxy
  3. Pods don't start anymore

Expected behavior

Controller pods should start and be able to communicate to the internet through the proxy

Screenshots and recordings

No response

OS / Distro

client=macOS 12.5.1, EKS node groups=bottlerocket

Flux version

flux: v0.35.0

Flux check

► checking prerequisites
✔ Kubernetes 1.21.14-eks-6d3986b >=1.20.6-0
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.25.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v0.29.0
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v0.27.0
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v0.30.0
► checking crds
✔ alerts.notification.toolkit.fluxcd.io/v1beta1
✔ buckets.source.toolkit.fluxcd.io/v1beta2
✔ gitrepositories.source.toolkit.fluxcd.io/v1beta2
✔ helmcharts.source.toolkit.fluxcd.io/v1beta2
✔ helmreleases.helm.toolkit.fluxcd.io/v2beta1
✔ helmrepositories.source.toolkit.fluxcd.io/v1beta2
✔ kustomizations.kustomize.toolkit.fluxcd.io/v1beta2
✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2
✔ providers.notification.toolkit.fluxcd.io/v1beta1
✔ receivers.notification.toolkit.fluxcd.io/v1beta1
✔ all checks passed

Git provider

Bitbucket although I'm using regular Git for bootstraping

Container Registry provider

No response

Additional context

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

romogo17 avatar Oct 08 '22 13:10 romogo17

Configuring the proxy environment variables in lowercase seem to have solved the issue

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - gotk-components.yaml
  - gotk-sync.yaml
patches:
  - patch: |
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: all
      spec:
        template:
          spec:
            containers:
              - name: manager
                env:
                  - name: "http_proxy"
                    value: "http://my-proxy-host:9595"
                  - name: "https_proxY"
                    value: "http://my-proxy-host:9595"
                  - name: "no_proxy"
                    value: "localhost,127.0.0.1,10.0.0.0/8,172.20.0.0/16,.cluster.local.,.cluster.local,.svc,.flux-system"
    target:
      kind: Deployment
      labelSelector: app.kubernetes.io/part-of=flux
      name: "source-controller"

romogo17 avatar Oct 09 '22 02:10 romogo17

@romogo17 thanks for reporting this issue.

IIRC most (if not all) of our proxy implementation relies on the upstream Go, which accounts for both uppercase and lowercase variants:

	return &Config{
		HTTPProxy:  getEnvAny("HTTP_PROXY", "http_proxy"),
		HTTPSProxy: getEnvAny("HTTPS_PROXY", "https_proxy"),
		NoProxy:    getEnvAny("NO_PROXY", "no_proxy"),
		CGI:        os.Getenv("REQUEST_METHOD") != "",
	}

https://github.com/golang/net/blob/8021a29435afef042814c3ad3b702ff04b240bc7/http/httpproxy/proxy.go#L91-L93

I noticed that your first patch did not include a HTTP_PROXY, only your second one, can you please confirm that setting the three env-vars (HTTP_PROXY, HTTPS_PROXY and NO_PROXY) on both upper or lower cases still yield different results?

pjbgf avatar Oct 10 '22 08:10 pjbgf

Hi @pjbgf, yes, I think that was it — my first patch didn't include a HTTP_PROXY. I just tried it with only the upper case variants (but including both HTTP and HTTPS and that worked)

So:

  • With the 3 env vars in upper case (HTTP_PROXY, HTTPS_PROXY and NO_PROXY): no issues, all good.
  • With the 3 env vars in lower case (http_proxy, http_proxy and no_proxy): the flux controllers start, but it seems like Helm still has issues pulling the repos (see https://github.com/helm/helm/issues/10065, but I'm guessing that's outside the scope of flux).
  • With only 2 env vars (HTTPS_PROXY and NO_PROXY): issues with the controllers starting — not sure why.

I also included the EKS service subnet (172.20.0.0/16 in the snippets above) to the NO_PROXY, but that should've already been covered by .cluster.local.


I general, I think adding the HTTP_PROXY to the Bootstrap cheatsheet would be useful. Happy to submit the PR. Would I need to update anything other than the MD file here?

romogo17 avatar Oct 10 '22 17:10 romogo17

hello, I had the same issue and what did the trick for me was to add service (and pod) subnet

sylvainOL avatar Nov 26 '22 10:11 sylvainOL

quick update, adding the dns ip address (10.43.0.1on rke2) is sufficient

sylvainOL avatar Nov 26 '22 15:11 sylvainOL

I also struggle with this. I set all proxy variables (http_proxy, HTTP_PROXY, https_proxy, HTTPS_PROXY, no_proxy, NO_PROXY) for the flux deployments, trying your guys solutions but nothing seems to solve the problems I observe. I added cluster.local and .cluster.local ot my no proxy vars because my controllers cannot reach each other via FQDN but it does not help. I suspect that the wget implementation used under the hood is causing this. It does not read the no proxy variables but rather needs the flag -Y off to omit using a proxy - testing this manually worked. Is there any other ideas how to solve this issue?

marrip avatar Feb 13 '23 12:02 marrip

I can confirm that I face the same problem. Tried both upper case and lower case env vars, still health checks fail for every flux pod

Paulius0112 avatar Feb 13 '23 16:02 Paulius0112

We have the same problem, env set in upper and in lower case.

derTobsch avatar Nov 07 '23 16:11 derTobsch

Same problem here (env set in upper and in lower case), anybody a fix for this?

LennertMertens avatar Nov 12 '23 19:11 LennertMertens

as I said in my previous comment, I've added DNS IP in the NO_PROXY env variable (10.43.0.1 on rke2) and it works now

sylvainOL avatar Nov 13 '23 07:11 sylvainOL

@sylvainOL, you're right. For people interested, here is my configuration that is currently working.

  • kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - gotk-components.yaml
  - gotk-sync.yaml
patches:
  - patch: |
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: all
      spec:
        template:
          spec:
            containers:
              - name: manager
                securityContext:
                  runAsUser: 65534
                  seccompProfile:
                    $patch: delete
                env:
                  - name: "HTTPS_PROXY"
                    value: "http://proxy.example.com:3128"
                  - name: "HTTP_PROXY"
                    value: "http://proxy.example.com:3128"
                  - name: "NO_PROXY"
                    # 172.30.0.1 is my DNS IP
                    value: ".cluster.local.,.cluster.local,.svc,10.24.62.0/24,172.30.0.1,172.30.0.0/24"
    target:
      kind: Deployment
      labelSelector: app.kubernetes.io/part-of=flux
  - patch: |-
      - op: remove
        path: /metadata/labels/pod-security.kubernetes.io~1warn
      - op: remove
        path: /metadata/labels/pod-security.kubernetes.io~1warn-version
    target:
      kind: Namespace
      labelSelector: app.kubernetes.io/part-of=flux

LennertMertens avatar Nov 13 '23 08:11 LennertMertens