github-action icon indicating copy to clipboard operation
github-action copied to clipboard

Setup step is failing in first run across multiple repos

Open andre-qumulo opened this issue 7 months ago • 13 comments

We have multiple repos using the v2 and v3 action versions. They are both using 1.82.5 and are consistently failing with:

Run echo "TAILSCALE_LATEST=$(curl -s "https://pkgs.tailscale.com/stable/?mode=json" | jq -r .Version)" \
  echo "TAILSCALE_LATEST=$(curl -s "https://pkgs.tailscale.com/stable/?mode=json" | jq -r .Version)" \
  >> "$GITHUB_OUTPUT"
  shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
Run tailscale/github-action@v2
  with:
    oauth-client-id: ***
    oauth-secret: ***
    tags: <tags>
    version: 1.82.5
Run if [ X64 = "ARM64" ]; then
  if [ X64 = "ARM64" ]; then
    TS_ARCH="arm64"
  elif [ X64 = "ARM" ]; then
    TS_ARCH="arm"
  elif [ X64 = "X86" ]; then
    TS_ARCH="386"
  elif [ X64 = "X64" ]; then
    TS_ARCH="amd64"
  else
    TS_ARCH="amd64"
  fi
  MINOR=$(echo "$VERSION" | awk -F '.' {'print $2'})
  if [ $((MINOR % 2)) -eq 0 ]; then
    URL="https://pkgs.tailscale.com/stable/tailscale_${VERSION}_${TS_ARCH}.tgz"
  else
    URL="https://pkgs.tailscale.com/unstable/tailscale_${VERSION}_${TS_ARCH}.tgz"
  fi
  if ! [[ "$SHA256SUM" ]] ; then
    SHA256SUM="$(curl -H user-agent:tailscale-github-action -L "${URL}.sha256")"
  fi
  curl -H user-agent:tailscale-github-action -L "$URL" -o tailscale.tgz --max-time 300
  echo "$SHA256SUM  tailscale.tgz" | sha256sum -c
  tar -C /tmp -xzf tailscale.tgz
  rm tailscale.tgz
  TSPATH=/tmp/tailscale_${VERSION}_${TS_ARCH}
  sudo mv "${TSPATH}/tailscale" "${TSPATH}/tailscaled" /usr/bin
  shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
  env:
    VERSION: 1.82.5
    SHA256SUM: 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    64  100    64    0     0    438      0 --:--:-- --:--:-- --:--:--   441
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    81  100    81    0     0    615      0 --:--:-- --:--:-- --:--:--   618

  0 30.0M    0  8499    0     0  27476      0  0:19:06 --:--:--  0:19:06 27476
100 30.0M  100 30.0M    0     0  37.2M      0 --:--:-- --:--:-- --:--:-- 60.4M
tailscale.tgz: OK
Run sudo -E tailscaled --state=mem: ${ADDITIONAL_DAEMON_ARGS} 2>~/tailscaled.log &
  sudo -E tailscaled --state=mem: ${ADDITIONAL_DAEMON_ARGS} 2>~/tailscaled.log &
  # And check that tailscaled came up. The CLI will block for a bit waiting
  # for it. And --json will make it exit with status 0 even if we're logged
  # out (as we will be). Without --json it returns an error if we're not up.
  sudo -E tailscale status --json >/dev/null
  shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
  env:
    ADDITIONAL_DAEMON_ARGS: 
Run if [ -z "${HOSTNAME}" ]; then
  if [ -z "${HOSTNAME}" ]; then
    HOSTNAME="github-$(cat /etc/hostname)"
  fi
  if [ -n "***" ]; then
    TAILSCALE_AUTHKEY="***?preauthorized=true&ephemeral=true"
    TAGS_ARG="--advertise-tags=<tags>"
  fi
  timeout 5m sudo -E tailscale up ${TAGS_ARG} --authkey=${TAILSCALE_AUTHKEY} --hostname=${HOSTNAME} --accept-routes ${ADDITIONAL_ARGS}
  shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
  env:
    TAILSCALE_AUTHKEY: 
    ADDITIONAL_ARGS: 
    HOSTNAME: 
    TS_EXPERIMENT_OAUTH_AUTHKEY: true
context canceled
Error: Process completed with exit code 124.

Our workflow files have this step:

    - name: Activate TailScale
      uses: tailscale/github-action@v3
      with:
        oauth-client-id: ${{ inputs.TS_OAUTH_ID }}
        oauth-secret: ${{ inputs.TS_OAUTH_SECRET }}
        tags: <tags>
        version: latest
        use-cache: true

Any thoughts on what could be going on?

andre-qumulo avatar May 13 '25 18:05 andre-qumulo

I have the same thing happening that started today on my v3 actions. Some of my runners will connect while others will eventually timeout.

Running version 1.82.5

timeout: sending signal TERM to command ‘sudo’
timeout: sending signal KILL to command ‘sudo’
/home/runner/work/_temp/4782142e-bc26-4e1c-a2c8-b88c8d232520.sh: line 15:   399 Killed                  timeout --verbose --kill-after=1s ${TIMEOUT} ${MAYBE_SUDO} tailscale up ${TAGS_ARG} --authkey=${TAILSCALE_AUTHKEY} --hostname=${HOSTNAME} --accept-routes ${ADDITIONAL_ARGS}

andrea-armstrong avatar May 13 '25 21:05 andrea-armstrong

Same here..

I have the same thing happening that started today on my v3 actions. Some of my runners will connect while others will eventually timeout.

Running version 1.82.5

timeout: sending signal TERM to command ‘sudo’
timeout: sending signal KILL to command ‘sudo’
/home/runner/work/_temp/4782142e-bc26-4e1c-a2c8-b88c8d232520.sh: line 15:   399 Killed                  timeout --verbose --kill-after=1s ${TIMEOUT} ${MAYBE_SUDO} tailscale up ${TAGS_ARG} --authkey=${TAILSCALE_AUTHKEY} --hostname=${HOSTNAME} --accept-routes ${ADDITIONAL_ARGS}

matanbaruch avatar May 14 '25 04:05 matanbaruch

We've also been seeing this sporadically for a few weeks but especially bad in the last few days. The tailscale up command in the action just times out. Also using 1.82.5.

...
  timeout --verbose --kill-after=1s ${TIMEOUT} ${MAYBE_SUDO} tailscale up ${TAGS_ARG} --authkey=${TAILSCALE_AUTHKEY} --hostname=${HOSTNAME} --accept-routes ${ADDITIONAL_ARGS}
  shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
  env:
    RESOLVED_VERSION: 1.82.5
    TS_ARCH: amd64
    SHA256SUM: 41a8931fa52055bd7ea4b51df9acff2ba2d4e9000c2380b667539b5b99991464
    ADDITIONAL_ARGS: 
    HOSTNAME: 
    TAILSCALE_AUTHKEY: 
    TIMEOUT: 2m
timeout: sending signal TERM to command ‘sudo’
context canceled

mikepilat avatar May 14 '25 15:05 mikepilat

Same

timeout: sending signal TERM to command ‘sudo’
context canceled
Error: Process completed with exit code 124.

mateuszkocik avatar May 15 '25 10:05 mateuszkocik

Any workaround for this problem?

igorpecovnik avatar May 15 '25 12:05 igorpecovnik

I'm not sure what the root cause is but something is clearly failing or getting stuck on the tailscale up command, so we just modified the action and implemented retries. It seems to be holding up:

        for i in {1..5}; do
          echo "Attempt $i to bring up Tailscale..."
          timeout --verbose --kill-after=1s ${TIMEOUT} ${MAYBE_SUDO} tailscale up ${TAGS_ARG} --authkey=${TAILSCALE_AUTHKEY} --hostname=${HOSTNAME} --accept-routes ${ADDITIONAL_ARGS} && break
          echo "Tailscale up failed. Retrying in $((i * 5)) seconds..."
          sleep $((i * 5))
        done

andrea-armstrong avatar May 15 '25 13:05 andrea-armstrong

We are impacted by this as well. Sporadic failing across our many repos.

ngetchell-pi avatar May 20 '25 19:05 ngetchell-pi

The same problem here, getting timeout: sending signal TERM to command ‘sudo’ on 30% of runs.

Any updates on this?

rmarku avatar May 22 '25 20:05 rmarku

Been happening on 100% of our runs starting today, this action in general has been the number one source of frustrations for us between the binary download issue and now this.

pushchris avatar May 22 '25 21:05 pushchris

I reached out to Tailscale Support last week and there is an open status page issue for this now (the May 15th one) but I'm still seeing it as well, so no resolution that I'm aware of.

I think the retry approach described above is probably the best workaround available right now.

mikepilat avatar May 23 '25 18:05 mikepilat

Hey, thanks for bringing this issue to our attention - and apologies for the recent instability.

@mikepilat is correct, the problem is caused by sporadic latency spikes when new nodes attempt to join a tailnet, which can lead the action to timeout.

We deployed a series of fixes to the platform over the last few days that hopefully made the action more stable. Looking ahead, our platform team is actively working on broader changes to better mitigate these issues. But these improvements are more involved and will take more time to rollout.

In the meantime, we'll incorporate the retry logic suggested by @andrea-armstrong to make the action more resilient against platform hiccups. I'm fully aware "sleep 10" is not ideal, but we hope it helps reduce friction as we work toward a long-term fix.

mcoulombe avatar May 23 '25 20:05 mcoulombe

Hi @mcoulombe ! I see the commit that fixes and adds the retry, but are y'all planning on cutting a new v3.x release of the tailscale action with the addition of the retry? We rely heavily on tailscale and this would definitely help a ton! Thank you in advance for any help 😄

zestrells avatar Jun 06 '25 15:06 zestrells

@zestrells v3.2.2 of the action includes the retry

rblaine95 avatar Jun 06 '25 15:06 rblaine95