actions-runner-controller occasionally `docker run` will hang for 30s-3min

Checks

[X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
[X] I am using charts that are officially provided

Controller Version

0.23.5

Deployment Method

Helm

Checks

[X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
[X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

- Deploy a self hosted runner to kubernetes. I am using `dockerdWithinRunnerContainer: true` but I believe the same thing will happen without that


apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
spec:
  replicas: 1
  template:
    spec:
      dockerdWithinRunnerContainer: true
      repository: ...
      serviceAccountName: ...
      securityContext:
        # For Ubuntu 20.04 runner
        fsGroup: 1000
      resources:
        limits:
          memory: 20Gi
        requests:
          cpu: "6"
          memory: 20Gi

Describe the bug

Occasionally I see docker run commands take 2-3 minutes to start executing. This is true even if I pull the image first. For example, if a workflow runs:

docker pull hello-world
echo "--- DONE PULLING ---"
time docker run hello-world

Then occasionally, maybe 5% of the time, I see the pull happen, followed by a long delay, and then the run will finally execute. This can take 2-3 minutes to happen.

I can manually recreate this by exec-ing into the k8s pod using 2 shells and running this in each:

while true; do
    time docker run hello-world
done

Initially this will run quite fast, but at some point you'll see that the runtime will start to take upwards of 2-3 minutes occasionally. This seems to be some contention, but I cannot identify where the contention exists.

Describe the expected behavior

Docker run should happen quickly and consistently.

Additional Context

Nothing of note here

Controller Logs

Nothing of note here

Runner Pod Logs

Nothing of note here

Oct 26 '23 21:10 niodice

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

Oct 26 '23 21:10 github-actions[bot]

Could you double check that this is not pending writes to disk? I found that network attached disks/ssds underperform alot.

I would concur that once the docker pull finished it should have written everything to disk and extracted the image ready to go.

I take it that this is without dind?

We run an in-memory configuration for reasons like this. Our package manager installs and docker pull are way to disk intensive that even network attached ssd's form a bottleneck. Local NVME might be viable but I have yet to find a good method to clean up volumes with emptyDir configuraiton

Dec 13 '23 11:12 genisd

Oddly, the issue seems to have gone away in the past few months. We isolated our GHA runners to a dedicated node group for other reasons, and I believe that whatever contention we had is no longer there. It may very well be I/O related as we do use network attached disks.

Could you double check that this is not pending writes to disk?

Can you elaborate on how I can do this? I have access to ssh'ing to the pod and also the underlying k8s node. It might be the case that the the write operation has finished but is still in a buffer somewhere and not actually written to the disk.

I take it that this is without dind?

With and without.

Local NVME might be viable but I have yet to find a good method to clean up volumes with emptyDir configuration

We've also experimented some with this but use ephemeral runners, so the disk cleanup is "for free" for us.

Dec 13 '23 15:12 niodice

actions-runner-controller actions-runner-controller copied to clipboard

occasionally `docker run` will hang for 30s-3min

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

actions-runner-controller
actions-runner-controller copied to clipboard