actions-runner-controller
actions-runner-controller copied to clipboard
occasionally `docker run` will hang for 30s-3min
Checks
- [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I am using charts that are officially provided
Controller Version
0.23.5
Deployment Method
Helm
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
- [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
- Deploy a self hosted runner to kubernetes. I am using `dockerdWithinRunnerContainer: true` but I believe the same thing will happen without that
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
spec:
replicas: 1
template:
spec:
dockerdWithinRunnerContainer: true
repository: ...
serviceAccountName: ...
securityContext:
# For Ubuntu 20.04 runner
fsGroup: 1000
resources:
limits:
memory: 20Gi
requests:
cpu: "6"
memory: 20Gi
Describe the bug
Occasionally I see docker run
commands take 2-3 minutes to start executing. This is true even if I pull the image first. For example, if a workflow runs:
docker pull hello-world
echo "--- DONE PULLING ---"
time docker run hello-world
Then occasionally, maybe 5% of the time, I see the pull happen, followed by a long delay, and then the run will finally execute. This can take 2-3 minutes to happen.
I can manually recreate this by exec-ing into the k8s pod using 2 shells and running this in each:
while true; do
time docker run hello-world
done
Initially this will run quite fast, but at some point you'll see that the runtime will start to take upwards of 2-3 minutes occasionally. This seems to be some contention, but I cannot identify where the contention exists.
Describe the expected behavior
Docker run should happen quickly and consistently.
Additional Context
Nothing of note here
Controller Logs
Nothing of note here
Runner Pod Logs
Nothing of note here
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
Could you double check that this is not pending writes to disk? I found that network attached disks/ssds underperform alot.
I would concur that once the docker pull finished it should have written everything to disk and extracted the image ready to go.
I take it that this is without dind
?
We run an in-memory configuration for reasons like this. Our package manager installs and docker pull are way to disk intensive that even network attached ssd's form a bottleneck. Local NVME might be viable but I have yet to find a good method to clean up volumes with emptyDir configuraiton
Oddly, the issue seems to have gone away in the past few months. We isolated our GHA runners to a dedicated node group for other reasons, and I believe that whatever contention we had is no longer there. It may very well be I/O related as we do use network attached disks.
Could you double check that this is not pending writes to disk?
Can you elaborate on how I can do this? I have access to ssh'ing to the pod and also the underlying k8s node. It might be the case that the the write operation has finished but is still in a buffer somewhere and not actually written to the disk.
I take it that this is without
dind
?
With and without.
Local NVME might be viable but I have yet to find a good method to clean up volumes with emptyDir configuration
We've also experimented some with this but use ephemeral runners, so the disk cleanup is "for free" for us.