Random DNS issue when using Github Actions
I'm trying to use TailScale GitHub actions on the latest (I also tested different versions) version of TailScale and getting these DNS Issues.
It also happens when attempting to install it manually on the machine while it is running,
I've tried injecting the nameserver and the search to /etc/resolve.conf
But it doesn't help in this case.
On the Admin console, I've defined the machine as an Ephermal and pre-approved machine.
It happens only on GitHub action machines.
This issue is something that happens and sometimes does not.
Thanks.
Are you using a Dockerfile runner or the Tailscale-supplied action.yml?
What's your GitHub runner type/version?
I tried both the GitHub actions and the manual installation, run on ubuntu-latest, and 20.04 (seems more stable). tried the 1.58.0 and the 1.56.0 of tailscale.
I was running into a lot of transient DNS resolution failures, followed this recommendation and it seems to be working a lot better: https://github.com/tailscale/github-action/issues/51#issuecomment-1497228382
I too encounter a lot of transient DNS errors, my deployment pipelines randomly fail like this:
> Run helm package ./deploy/chart \
Successfully packaged chart and saved it to: /home/runner/work/.../..../......tgz
Error: Kubernetes cluster unreachable: Get "https://xxxxx.gr7.eu-central-1.eks.amazonaws.com/version": dial tcp: lookup xxxxx.gr7.eu-central-1.eks.amazonaws.com on 127.0.0.53:53: read udp 127.0.0.1:40699->127.0.0.53:53: i/o timeout
It was working fine a few weeks ago, now I have to restart my deployment pipelines a lot.
I saw this a while back, but it seemed to go away for a while, then it became a problem again about a week ago. We are using the standard hosted runner and the following action. When it started causing us problems last week, we added the Tailscale version based on the same issue @matthewjthomas referenced, #51. It has not made a difference.
This is our action.
name: 'connect_tailscale'
description: 'Connects to Tailscale'
inputs:
ts_oauth_client_id:
description: 'TS_OAUTH_CLIENT_ID'
required: true
ts_oauth_secret:
description: 'TS_OAUTH_SECRET'
required: true
runs:
using: 'composite'
steps:
- name: Tailscale
uses: tailscale/github-action@v2
with:
version: 1.64.0
oauth-client-id: ${{ inputs.TS_OAUTH_CLIENT_ID }}
oauth-secret: ${{ inputs.TS_OAUTH_SECRET }}
tags: tag:github
args: --accept-routes --accept-dns
We are also experiencing DNS timeouts with tailscale in our ci. Our setup
- name: Tailscale
uses: tailscale/github-action@v2
with:
oauth-client-id: ${{ env.TS_OAUTH_CLIENT_ID }}
oauth-secret: ${{ env.TS_OAUTH_SECRET }}
tags: tag:ci
version: 1.64.0
We found that the tailscale action is "reporting ready" to quickly.
It waits for tailscale status to return ok, but it takes another ~10s until DNS becomes available. So sleeping for 10s after the connect step usually solves the issue.
I'd like to have a more consistent way of waiting for DNS to become ready though.
We found that the tailscale action is "reporting ready" to quickly. It waits for
tailscale statusto return ok, but it takes another ~10s until DNS becomes available. So sleeping for 10s after the connect step usually solves the issue.I'd like to have a more consistent way of waiting for DNS to become ready though.
I've just hit this problem (again) and it took several minutes for tailscale network to be in a working state (I put a sleep 600 and tried to ssh into the github runner, it took at least 3 minutes before my ssh went through). I'm wondering if the problem could be caused by an overloaded github network.
In any case, I agree with @arnecls, it would be nice to have a tailscale command that could wait until magic dns is in working order.
Also seeing this at the moment, trying a sleep 10 as we speak but yes ideally there would be a way to explicitly wait for the propagation to happen.
@lukeramsden I'm glad someone is having the issue at the same time as me.
My hypothesis:
- flaky github network.
- nearest github actions DERP servers overloaded.
@sylr same for us, especially today
sleep 10 seemed to work as a one-off for me just now but sounds like there can be quite a lot of variance
It's a bit hackish but less dumb than a sleep: https://github.com/sylr/tailscale-github-action/commit/338b779780551d54da6998c6e46948983434dae2
I've forked the github action and added this at the end:
timeout --verbose --kill-after=1s ${TIMEOUT} sudo -E bash -c 'while tailscale dns query google.com. a | grep "failed to query DNS"; do sleep 1; done'
Currently seeing this:
Run if [ X64 = "ARM64" ]; then
Downloading https://pkgs.tailscale.com/stable/tailscale_1.76.6_amd64.tgz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 81 100 81 0 0 362 0 --:--:-- --:--:-- --:--:-- 363
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 27.6M 100 27.6M 0 0 42.5M 0 --:--:-- --:--:-- --:--:-- 80.4M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 64 100 64 0 0 346 0 --:--:-- --:--:-- --:--:-- 347
Expected sha256: 08f2377b78f7b9e411caa28f231a9c4cd0887209c142b49b815bcc7042ff61f7
Actual sha256: 08f2377b78f7b9e411caa28f231a9c4cd0887209c142b49b815bcc7042ff61f7 tailscale.tgz
tailscale.tgz: OK
Run if [ "$STATEDIR" == "" ]; then
Run if [ -z "${HOSTNAME}" ]; then
if [ -z "${HOSTNAME}" ]; then
HOSTNAME="github-$(cat /etc/hostname)"
fi
if [ -n "***" ]; then
TAILSCALE_AUTHKEY="***?preauthorized=true&ephemeral=true"
TAGS_ARG="--advertise-tags=tag:github-actions"
fi
timeout --verbose --kill-after=1s ${TIMEOUT} sudo -E tailscale up ${TAGS_ARG} --authkey=${TAILSCALE_AUTHKEY} --hostname=${HOSTNAME} --accept-routes ${ADDITIONAL_ARGS}
timeout --verbose --kill-after=1s ${TIMEOUT} sudo -E bash -c 'while tailscale dns query google.com. a | grep "failed to query DNS"; do sleep 1; done'
shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
env:
AWS_DEFAULT_REGION: eu-central-1
AWS_REGION: eu-central-1
AWS_ACCESS_KEY_ID: ***
AWS_SECRET_ACCESS_KEY: ***
AWS_SESSION_TOKEN: ***
ADDITIONAL_ARGS:
HOSTNAME:
TAILSCALE_AUTHKEY:
TIMEOUT: 2m
TS_EXPERIMENT_OAUTH_AUTHKEY: true
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
timeout: sending signal TERM to command ‘sudo’
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
- flaky github network.
- nearest github actions DERP servers overloaded.
It makes sense, maybe the us-east is running workflows just before lunch :D
we're seeing this with github actions:
- hosted runners (ubuntu 22.04 and 24.04)
- tailscale
1.78.1
read udp 127.0.0.1:48903->127.0.0.53:53: i/o timeout
this is failing 100% of the time now when SplitDNS is used on Github Action hosted runners:
timeout --verbose --kill-after=1s 2m sudo -E bash -c 'while tailscale dns query MY-CLUSTER-BEHIND-TAILSCALE-SPLIT-DNS.eks.amazonaws.com. a | grep "failed to query DNS"; do sleep 1; done'
failed to query DNS: 500 Internal Server Error: waiting for response or error from [172.31.0.2]: context deadline exceeded
Yep all my deploys failing atm. Seems to be correlated with other people. I'm also hosting my runners on https://blacksmith.sh/ and not using GitHub Actions hosted runners so maybe its on Tailscale's end?
tailscale status now showing degraded performance in Coordinator server https://status.tailscale.com/
I've made a feature request for this: https://github.com/tailscale/github-action/issues/146
On this note, even before this outage we'd sometimes still get timeouts because of what I assume is coordination latency - is this something other folks have experienced?
it's working now, it seems like it was the coordinator server after all
We are still encountering lots of timeouts, is there another incident on going? Status page is reporting all green
I'm seeing it take 15 minutes for DNS to propagate
Has anyone else tried to ring tailscale support about this ? I've sent 2 mails without response.
it's not great, they acknowledged it in their status page today but it's been like this since Saturday, it's pretty bad for Split DNS.
Has anyone else tried to ring tailscale support about this ? I've sent 2 mails without response.
not heard anything either
We faced same issue without response from support too :/
we're seeing this issue again, it's timing out not just in Github actions but also our on-premise ubuntu machines with split DNS
yes, same for us
we're seeing this with github actions:
- hosted runners (ubuntu 22.04 and 24.04)
- tailscale
1.78.1read udp 127.0.0.1:48903->127.0.0.53:53: i/o timeout
experiencing exact this behavior too. Become worse last quarter.