actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Docker container in dind containerMode cannot connect to Github

Open duchuyvp opened this issue 1 year ago • 13 comments

Checks

  • [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
  • [X] I am using charts that are officially provided

Controller Version

0.9.3

Deployment Method

ArgoCD

Checks

  • [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Deploy the gha-runner-scale-set-controller first with default values
   Deploy `gha-runner-scale-set` chart with release name `arc-runner-set`
   1.1 At this point, Github Actions work for simple workflow file.
2. Exec into `runner` container in `action-runne-set-****-runner-****` pod
3. Run `sudo apt update && sudo apt install git -y && git clone https://github.com/actions/actions-runner-controller.git` to make sure pod has access to public internet
4. Run `docker run --rm -it alpine sh -c "apk add git && git clone https://github.com/actions/actions-runner-controller.git"`

Describe the bug

Output from step 4:

fetch https://dl-cdn.alpinelinux.org/alpine/v3.20/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.20/community/x86_64/APKINDEX.tar.gz
(1/13) Installing ca-certificates (20240705-r0)
(2/13) Installing brotli-libs (1.1.0-r2)
(3/13) Installing c-ares (1.28.1-r0)
(4/13) Installing libunistring (1.2-r0)
(5/13) Installing libidn2 (2.3.7-r0)
(6/13) Installing nghttp2-libs (1.62.1-r0)
(7/13) Installing libpsl (0.21.5-r1)
(8/13) Installing zstd-libs (1.5.6-r0)
(9/13) Installing libcurl (8.9.0-r0)
(10/13) Installing libexpat (2.6.2-r0)
(11/13) Installing pcre2 (10.43-r0)
(12/13) Installing git (2.45.2-r0)
(13/13) Installing git-init-template (2.45.2-r0)
Executing busybox-1.36.1-r29.trigger
Executing ca-certificates-20240705-r0.trigger
OK: 20 MiB in 27 packages
Cloning into 'actions-runner-controller'...
fatal: unable to access 'https://github.com/actions/actions-runner-controller.git/': SSL connection timeout

image

Describe the expected behavior

docker run command above run correctly without SSL connection timeout error

Additional Context

Yaml manifest I using to deploy `gha-runner-scale-set-controller` and `gha-runner-scale-set`


apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: arc
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  source:
    repoURL: ghcr.io/actions/actions-runner-controller-charts
    targetRevision: 0.9.3
    chart: gha-runner-scale-set-controller
    helm:
      releaseName: arc
  destination:
    name: in-cluster
    namespace: arc-systems
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: false
    syncOptions:
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground
      - PruneLast=false
      - ServerSideApply=true
      - ApplyOutOfSyncOnly=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
  revisionHistoryLimit: 3
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: arc-runner-set
  namespace: argocd
spec:
  project: default
  destination:
    name: in-cluster
    namespace: arc-runners
  syncPolicy:
    automated:
      selfHeal: true
      allowEmpty: false
    syncOptions:
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground
      - ServerSideApply=true
      - ApplyOutOfSyncOnly=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
  revisionHistoryLimit: 3

  source:
    repoURL: ghcr.io/actions/actions-runner-controller-charts
    targetRevision: 0.9.3
    chart: gha-runner-scale-set
    helm:
      releaseName: arc-runner-set
      parameters:
        - name: controllerServiceAccount.namespace
          value: arc-systems
        - name: controllerServiceAccount.name
          value: arc-gha-rs-controller
        - name: githubConfigUrl
          value: https://github.com/<organization>
        - name: minRunners
          value: "5"
        - name: containerMode.type
          value: dind
        - name: githubConfigSecret
          value: github-app-secret

Controller Logs

https://gist.github.com/duchuyvp/9b626aec67926976f09c52d303becd1a

Runner Pod Logs

This is logs when I push this workflow file:


name: Reproduce

on:
  push:
    branches: ['*']

jobs:
  push-reproduce:
    runs-on: arc-runner-set

    steps:
      - run: sudo apt update && sudo apt install git -y
      - run: git clone https://github.com/actions/actions-runner-controller.git
      - run: docker run --rm alpine sh -c "apk add git && git clone https://github.com/actions/actions-runner-controller.git"

https://gist.github.com/duchuyvp/6a5db187bfb3657a5361bcf62b0bd4ef

duchuyvp avatar Aug 01 '24 05:08 duchuyvp

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

github-actions[bot] avatar Aug 01 '24 05:08 github-actions[bot]

@duchuyvp , do you happen to run the deployment on GKE?

norman-zon avatar Aug 01 '24 13:08 norman-zon

@norman-zon I haven't test on GKE, I deployed on-prems

duchuyvp avatar Aug 01 '24 22:08 duchuyvp

Try setting MTU for the docker daemon like:

- name: dind
          image: docker:dind
          args:
            - dockerd
            - --host=unix:///var/run/docker.sock
            - --group=$(DOCKER_GROUP_GID)
            - --mtu=1460

The default docker daemon MTU is 1500, but my host network has 1460. So aligning the docker daemon MTU fixed it for me.

norman-zon avatar Aug 02 '24 12:08 norman-zon

@norman-zon Thank you so much, your idea works for me too, I tried to patch one runner pod to add --mtu=1450 to dind container. But I don't know how to add this args when deploy with helm, since dind-container seems to be fixed in gha-runner-scale-set chart https://github.com/actions/actions-runner-controller/blob/a152741a1a6afa992f8d836a029d551984149c8f/charts/gha-runner-scale-set/templates/_helpers.tpl#L98-L116

Could you please show me how?

duchuyvp avatar Aug 08 '24 12:08 duchuyvp

I ended up using the solution with a configMap as described in the discussion here.

You have to set

containerMode:
    type: none

and then completely specify the template for the container, as described in the values file.

This could be be easier to add to the dind container, if my PR would be merged...

norman-zon avatar Aug 08 '24 13:08 norman-zon

Unfortunately this didn't solve our issue, which is ostensibly the same.

We have self-hosted runners in an on-premises OpenStack K8s cluster. For container actions which specify our own helper image with some useful utilities installed we can not connect to Github to clone the relevant repository. We have tried with both checkout actions, the GitHub cli and standard git with auth setup in the job.

After seeing this post we modified the DinD container as suggested passing the mtu argument and verified that this was indeed being set. And as a test followed the GP's example, trying to clone from the Runner container after installing git, which succeeded, then from the spawned helper container we tried to clone via the already installed git, which failed. All the different tests we have conducted resulted in variations of the same theme - ssl/tls timeout errors:

kubectl exec -it github-runner-scale-set-hello-world-cbr74-runner-jdr2z -- sh
Defaulted container "runner" out of: runner, dind, init-dind-externals (init)
$ sudo apt install git -y
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
<snipped>
Setting up git (1:2.46.0-0ppa1~ubuntu22.04.1) ...
Processing triggers for libc-bin (2.35-0ubuntu3.8) ...
$ git clone https://github.com/actions/actions-runner-controller.git <-- we can clone in runner container after installing git
Cloning into 'actions-runner-controller'...
remote: Enumerating objects: 12348, done.
remote: Counting objects: 100% (27/27), done.
remote: Compressing objects: 100% (26/26), done.
remote: Total 12348 (delta 11), reused 8 (delta 1), pack-reused 12321 (from 1)
Receiving objects: 100% (12348/12348), 5.44 MiB | 33.33 MiB/s, done.
Resolving deltas: 100% (8430/8430), done.
$ ls -ltr actions-runner-controller
drwxr-xr-x 23 runner runner  4096 Aug 14 06:42 actions-runner-controller
$ docker ps
CONTAINER ID   IMAGE                                                            COMMAND               CREATED              STATUS              PORTS     NAMES
cd3c11559488   ghcr.io/***/pipeline-helper:0.0.4   "tail -f /dev/null"   About a minute ago   Up About a minute             e588e3cf54e848bd99acc500aeec932e_ghcrio***pipelinehelper004_3c7f01
$ docker exec -it cd3c11559488 sh
/ # git --version <-- git already installed in container job
git version 2.45.2
/ # git clone https://github.com/actions/actions-runner-controller.git
Cloning into 'actions-runner-controller'...
fatal: unable to access 'https://github.com/actions/actions-runner-controller.git/': SSL connection timeout
Error: Process completed with exit code 128.

The specific error when using the GitHub Cli was error validating token: Get "https://api.github.com/": net/http: TLS handshake timeout

stuio avatar Aug 14 '24 07:08 stuio

@nikola-jokic HI. i am not sure why in the original Helm there is not way to change the DinD config as its looked in the helm _helpers.tpl



{{ - define "gha-runner-scale-set.dind-container" -}}
image: docker:dind
args:
  - dockerd
  - --host=unix:///var/run/docker.sock
  - --group=$(DOCKER_GROUP_GID)
env:
  - name: DOCKER_GROUP_GID
    value: "123"
securityContext:
  privileged: true
volumeMounts:
  - name: work
    mountPath: /home/runner/_work
  - name: dind-sock
    mountPath: /var/run
  - name: dind-externals
    mountPath: /home/runner/externals
{{- end }}

noamgreen avatar Aug 17 '24 08:08 noamgreen

In my values file I specified (along with the init and runner container).

template:
  spec:
    containers:
    - name: dind
      image: docker:dind
      args:
        - dockerd
        - --host=unix:///var/run/docker.sock
        - --group=$(DOCKER_GROUP_GID)
        - --mtu=1400

which works for the default network, but dependabot creates it's own networks with no MTU setting, so it defaults to 1500 and dependabot breaks.

So that would fix the auto-created networks, but it won't help if you create docker networks as part of your actions.

na4ma4 avatar Aug 19 '24 04:08 na4ma4

I ended up using the solution discussed here, writing a deamon.json configMap and mounting it inside the container to /etc/docker/daemon.json.

This allow for setting

"bridge": {
  "com.docker.network.driver.mtu": "1460"

which is also used for all networks created by actions.

norman-zon avatar Aug 20 '24 06:08 norman-zon

I was going to update today, I saw that moby/moby#43197 has been merged (earlier this year/late last year) and that solves my issue by adding this argument --default-network-opt=bridge=com.docker.network.driver.mtu=1400.

Now when dependabot calls the docker API (not using a shell, so the shims don't help) creating a network for the updater container it now has the MTU set to 1400.

template:
  spec:
    containers:
    - name: dind
      image: docker:dind
      args:
        - dockerd
        - --host=unix:///var/run/docker.sock
        - --group=$(DOCKER_GROUP_GID)
        - --mtu=1400
        - --default-network-opt=bridge=com.docker.network.driver.mtu=1400

From the dind container in the dependabot runner pod.

$ docker network inspect dependabot-job-11050-external-network

Output (cut for size):

[
    {
        "Name": "dependabot-job-11050-external-network",
        "Id": "dff4d1a3f843634c060258f5e808050ac9861ba487a0a0c677278506321374ea",
        "Created": "2024-08-20T07:10:54.585512615Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": { ... },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": { ... }
        },
        "Options": {
            "com.docker.network.driver.mtu": "1400"
        },
        "Labels": {}
    }
]

na4ma4 avatar Aug 20 '24 07:08 na4ma4

Maybe these two options (container args and ConfigMap) should be added to the docs, considering how many reactions this issue got?

norman-zon avatar Aug 20 '24 07:08 norman-zon

same issues occures on older version (0.9.0).

curl -v https://github.com fails on (1)HELLO but curl -v --resolve github.com:443:140.82.121.3 https://github.com/ works. and with proxy it works as well.

working workaround:

  1. setup configmap for daemon.json with lower mtu (1400).
  2. comment out "containerMode.type=dind"
  3. use a custom template. (https://github.com/actions/actions-runner-controller/discussions/2993#discussioncomment-8071798)

after this patch it works with 0.9.3.

any Idea why only github have this connectivity issue? what bug should be raised?

KostaGorod avatar Oct 20 '24 15:10 KostaGorod

For those that came here and don't know exactly what to do (as I didn't) here is how I "fixed".

My setup is:

  • baremetal machine
  • docker setup with containerd
  • vanilla k8s installation
  • flannel cni - no modifications

My k8s cni are setup to use 1450 MTU, so, I changed docker MTU to 1450 and applied these manifests here:

Docker Daemon JSON:

{
  "mtu": 1450,
  "dns": [ "<your-ipv4-gateway>",  "8.8.8.8", "8.8.4.4"],
  "hosts": ["unix:///var/run/docker.sock", "tcp://127.0.0.1:2375"]
}

Helm Command:

helm upgrade --install --namespace actions-runner-system --create-namespace   --set=authSecret.create=true  -f values.yaml  --set=authSecret.github_token="<your-token>"   --wait  actions-runner-controller actions-runner-controller/actions-runner-controller

For the Actions Runner Controller Helm Chart values file:

runner:
  containerMode:
    type: "dind"

For the Runner Deployment configuration:

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: app
spec:
  replicas: 1
  template:
   spec:
      image: summerwind/actions-runner-dind
      dockerdWithinRunnerContainer: true
      repository: owner/repo
      dockerMTU: 1450
      env:
        - name: ARC_DOCKER_MTU_PROPAGATION
          value: "true"

Now, for me, checkout is working perfectly.

ghost avatar Apr 12 '25 21:04 ghost

Thanks for @na4ma4 answer. this works for me.

I'm using the gha-runner-scale-set with dind mode. one workflow need to pull and run image in the github-runner pod. when worflow starts, the runner creates a bridge-type network interface with MTU of 1500. the kubernetes pod network is configured with an MTU of 1450. The MTU mismatch could cause dropped packets. This results in the following error:

unable to access 'https://github.com/xxx/xxx/': gnutls_handshake() failed: Error in the pull function.

Fix: using --default-network-opt=bridge=com.docker.network.driver.mtu=1450

image: docker:dind
args:
  - dockerd
  - --host=unix:///run/docker/docker.sock
  - --group=$(DOCKER_GROUP_GID)
  - --default-network-opt=bridge=com.docker.network.driver.mtu=1450

smallc2009 avatar Apr 25 '25 11:04 smallc2009

I think this is safe to close. Thank you, everyone, for providing answers!

nikola-jokic avatar Apr 25 '25 11:04 nikola-jokic