github-action icon indicating copy to clipboard operation
github-action copied to clipboard

Random DNS issue when using Github Actions

Open saarw-opti opened this issue 1 year ago • 31 comments

Screenshot 2024-01-24 at 10 47 32

I'm trying to use TailScale GitHub actions on the latest (I also tested different versions) version of TailScale and getting these DNS Issues. It also happens when attempting to install it manually on the machine while it is running, I've tried injecting the nameserver and the search to /etc/resolve.conf But it doesn't help in this case. On the Admin console, I've defined the machine as an Ephermal and pre-approved machine. It happens only on GitHub action machines. This issue is something that happens and sometimes does not.

Thanks.

saarw-opti avatar Jan 24 '24 08:01 saarw-opti

Are you using a Dockerfile runner or the Tailscale-supplied action.yml?

What's your GitHub runner type/version?

bradfitz avatar Jan 29 '24 22:01 bradfitz

I tried both the GitHub actions and the manual installation, run on ubuntu-latest, and 20.04 (seems more stable). tried the 1.58.0 and the 1.56.0 of tailscale.

saarw-opti avatar Jan 30 '24 10:01 saarw-opti

I was running into a lot of transient DNS resolution failures, followed this recommendation and it seems to be working a lot better: https://github.com/tailscale/github-action/issues/51#issuecomment-1497228382

matthewjthomas avatar Apr 04 '24 20:04 matthewjthomas

I too encounter a lot of transient DNS errors, my deployment pipelines randomly fail like this:

> Run helm package ./deploy/chart \
Successfully packaged chart and saved it to: /home/runner/work/.../..../......tgz
Error: Kubernetes cluster unreachable: Get "https://xxxxx.gr7.eu-central-1.eks.amazonaws.com/version": dial tcp: lookup xxxxx.gr7.eu-central-1.eks.amazonaws.com on 127.0.0.53:53: read udp 127.0.0.1:40699->127.0.0.53:53: i/o timeout

It was working fine a few weeks ago, now I have to restart my deployment pipelines a lot.

sylr avatar Apr 05 '24 15:04 sylr

I saw this a while back, but it seemed to go away for a while, then it became a problem again about a week ago. We are using the standard hosted runner and the following action. When it started causing us problems last week, we added the Tailscale version based on the same issue @matthewjthomas referenced, #51. It has not made a difference.

This is our action.

name: 'connect_tailscale'
description: 'Connects to Tailscale'
inputs:
    ts_oauth_client_id:
        description: 'TS_OAUTH_CLIENT_ID'
        required: true
    ts_oauth_secret:
        description: 'TS_OAUTH_SECRET'
        required: true
runs:
    using: 'composite'
    steps:
        - name: Tailscale
          uses: tailscale/github-action@v2
          with:
              version: 1.64.0
              oauth-client-id: ${{ inputs.TS_OAUTH_CLIENT_ID }}
              oauth-secret: ${{ inputs.TS_OAUTH_SECRET }}
              tags: tag:github
              args: --accept-routes --accept-dns

dgivens avatar May 02 '24 20:05 dgivens

We are also experiencing DNS timeouts with tailscale in our ci. Our setup

     - name: Tailscale
        uses: tailscale/github-action@v2
        with:
          oauth-client-id: ${{ env.TS_OAUTH_CLIENT_ID }}
          oauth-secret: ${{ env.TS_OAUTH_SECRET }}
          tags: tag:ci
          version: 1.64.0

KlausVii avatar May 14 '24 15:05 KlausVii

We found that the tailscale action is "reporting ready" to quickly. It waits for tailscale status to return ok, but it takes another ~10s until DNS becomes available. So sleeping for 10s after the connect step usually solves the issue.

I'd like to have a more consistent way of waiting for DNS to become ready though.

arnecls avatar Jun 24 '24 15:06 arnecls

We found that the tailscale action is "reporting ready" to quickly. It waits for tailscale status to return ok, but it takes another ~10s until DNS becomes available. So sleeping for 10s after the connect step usually solves the issue.

I'd like to have a more consistent way of waiting for DNS to become ready though.

I've just hit this problem (again) and it took several minutes for tailscale network to be in a working state (I put a sleep 600 and tried to ssh into the github runner, it took at least 3 minutes before my ssh went through). I'm wondering if the problem could be caused by an overloaded github network.

In any case, I agree with @arnecls, it would be nice to have a tailscale command that could wait until magic dns is in working order.

sylr avatar Dec 04 '24 18:12 sylr

Also seeing this at the moment, trying a sleep 10 as we speak but yes ideally there would be a way to explicitly wait for the propagation to happen.

lukeramsden avatar Dec 04 '24 18:12 lukeramsden

@lukeramsden I'm glad someone is having the issue at the same time as me.

My hypothesis:

  • flaky github network.
  • nearest github actions DERP servers overloaded.

sylr avatar Dec 04 '24 18:12 sylr

@sylr same for us, especially today

lukasmrtvy avatar Dec 04 '24 18:12 lukasmrtvy

sleep 10 seemed to work as a one-off for me just now but sounds like there can be quite a lot of variance

lukeramsden avatar Dec 04 '24 18:12 lukeramsden

It's a bit hackish but less dumb than a sleep: https://github.com/sylr/tailscale-github-action/commit/338b779780551d54da6998c6e46948983434dae2

sylr avatar Dec 04 '24 18:12 sylr

I've forked the github action and added this at the end:

  timeout --verbose --kill-after=1s ${TIMEOUT} sudo -E bash -c 'while tailscale dns query google.com. a | grep "failed to query DNS"; do sleep 1; done'

Currently seeing this:


Run if [ X64 = "ARM64" ]; then
Downloading https://pkgs.tailscale.com/stable/tailscale_1.76.6_amd64.tgz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    81  100    81    0     0    362      0 --:--:-- --:--:-- --:--:--   363

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 27.6M  100 27.6M    0     0  42.5M      0 --:--:-- --:--:-- --:--:-- 80.4M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    64  100    64    0     0    346      0 --:--:-- --:--:-- --:--:--   347
Expected sha256: 08f2377b78f7b9e411caa28f231a9c4cd0887209c142b49b815bcc7042ff61f7
Actual sha256: 08f2377b78f7b9e411caa28f231a9c4cd0887209c142b49b815bcc7042ff61f7  tailscale.tgz
tailscale.tgz: OK
Run if [ "$STATEDIR" == "" ]; then
Run if [ -z "${HOSTNAME}" ]; then
  if [ -z "${HOSTNAME}" ]; then
    HOSTNAME="github-$(cat /etc/hostname)"
  fi
  if [ -n "***" ]; then
    TAILSCALE_AUTHKEY="***?preauthorized=true&ephemeral=true"
    TAGS_ARG="--advertise-tags=tag:github-actions"
  fi
  timeout --verbose --kill-after=1s ${TIMEOUT} sudo -E tailscale up ${TAGS_ARG} --authkey=${TAILSCALE_AUTHKEY} --hostname=${HOSTNAME} --accept-routes ${ADDITIONAL_ARGS}
  timeout --verbose --kill-after=1s ${TIMEOUT} sudo -E bash -c 'while tailscale dns query google.com. a | grep "failed to query DNS"; do sleep 1; done'
  shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
  env:
    AWS_DEFAULT_REGION: eu-central-1
    AWS_REGION: eu-central-1
    AWS_ACCESS_KEY_ID: ***
    AWS_SECRET_ACCESS_KEY: ***
    AWS_SESSION_TOKEN: ***
    ADDITIONAL_ARGS: 
    HOSTNAME: 
    TAILSCALE_AUTHKEY: 
    TIMEOUT: 2m
    TS_EXPERIMENT_OAUTH_AUTHKEY: true
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
timeout: sending signal TERM to command ‘sudo’
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded

sylr avatar Dec 09 '24 15:12 sylr

  • flaky github network.
  • nearest github actions DERP servers overloaded.

It makes sense, maybe the us-east is running workflows just before lunch :D

lukasmrtvy avatar Dec 09 '24 16:12 lukasmrtvy

we're seeing this with github actions:

  • hosted runners (ubuntu 22.04 and 24.04)
  • tailscale 1.78.1
read udp 127.0.0.1:48903->127.0.0.53:53: i/o timeout

bithavoc avatar Dec 09 '24 16:12 bithavoc

this is failing 100% of the time now when SplitDNS is used on Github Action hosted runners:

timeout --verbose --kill-after=1s 2m sudo -E bash -c 'while tailscale dns query MY-CLUSTER-BEHIND-TAILSCALE-SPLIT-DNS.eks.amazonaws.com. a | grep "failed to query DNS"; do sleep 1; done'
  
failed to query DNS: 500 Internal Server Error: waiting for response or error from [172.31.0.2]: context deadline exceeded

bithavoc avatar Dec 09 '24 17:12 bithavoc

Yep all my deploys failing atm. Seems to be correlated with other people. I'm also hosting my runners on https://blacksmith.sh/ and not using GitHub Actions hosted runners so maybe its on Tailscale's end?

lukeramsden avatar Dec 09 '24 17:12 lukeramsden

tailscale status now showing degraded performance in Coordinator server https://status.tailscale.com/

bithavoc avatar Dec 09 '24 17:12 bithavoc

I've made a feature request for this: https://github.com/tailscale/github-action/issues/146

On this note, even before this outage we'd sometimes still get timeouts because of what I assume is coordination latency - is this something other folks have experienced?

aaomidi avatar Dec 09 '24 17:12 aaomidi

it's working now, it seems like it was the coordinator server after all

bithavoc avatar Dec 09 '24 17:12 bithavoc

We are still encountering lots of timeouts, is there another incident on going? Status page is reporting all green

KlausVii avatar Dec 16 '24 14:12 KlausVii

image image

I'm seeing it take 15 minutes for DNS to propagate

lukeramsden avatar Dec 16 '24 14:12 lukeramsden

Has anyone else tried to ring tailscale support about this ? I've sent 2 mails without response.

sylr avatar Dec 16 '24 16:12 sylr

it's not great, they acknowledged it in their status page today but it's been like this since Saturday, it's pretty bad for Split DNS.

bithavoc avatar Dec 16 '24 17:12 bithavoc

Has anyone else tried to ring tailscale support about this ? I've sent 2 mails without response.

not heard anything either

lukeramsden avatar Dec 16 '24 20:12 lukeramsden

We faced same issue without response from support too :/

nicolasbriere1 avatar Dec 17 '24 09:12 nicolasbriere1

we're seeing this issue again, it's timing out not just in Github actions but also our on-premise ubuntu machines with split DNS

bithavoc avatar Jan 06 '25 18:01 bithavoc

yes, same for us

lukasmrtvy avatar Jan 06 '25 18:01 lukasmrtvy

we're seeing this with github actions:

  • hosted runners (ubuntu 22.04 and 24.04)
  • tailscale 1.78.1
read udp 127.0.0.1:48903->127.0.0.53:53: i/o timeout

experiencing exact this behavior too. Become worse last quarter.

zenire avatar Jan 13 '25 13:01 zenire