Setup step is failing in first run across multiple repos
We have multiple repos using the v2 and v3 action versions. They are both using 1.82.5 and are consistently failing with:
Run echo "TAILSCALE_LATEST=$(curl -s "https://pkgs.tailscale.com/stable/?mode=json" | jq -r .Version)" \
echo "TAILSCALE_LATEST=$(curl -s "https://pkgs.tailscale.com/stable/?mode=json" | jq -r .Version)" \
>> "$GITHUB_OUTPUT"
shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
Run tailscale/github-action@v2
with:
oauth-client-id: ***
oauth-secret: ***
tags: <tags>
version: 1.82.5
Run if [ X64 = "ARM64" ]; then
if [ X64 = "ARM64" ]; then
TS_ARCH="arm64"
elif [ X64 = "ARM" ]; then
TS_ARCH="arm"
elif [ X64 = "X86" ]; then
TS_ARCH="386"
elif [ X64 = "X64" ]; then
TS_ARCH="amd64"
else
TS_ARCH="amd64"
fi
MINOR=$(echo "$VERSION" | awk -F '.' {'print $2'})
if [ $((MINOR % 2)) -eq 0 ]; then
URL="https://pkgs.tailscale.com/stable/tailscale_${VERSION}_${TS_ARCH}.tgz"
else
URL="https://pkgs.tailscale.com/unstable/tailscale_${VERSION}_${TS_ARCH}.tgz"
fi
if ! [[ "$SHA256SUM" ]] ; then
SHA256SUM="$(curl -H user-agent:tailscale-github-action -L "${URL}.sha256")"
fi
curl -H user-agent:tailscale-github-action -L "$URL" -o tailscale.tgz --max-time 300
echo "$SHA256SUM tailscale.tgz" | sha256sum -c
tar -C /tmp -xzf tailscale.tgz
rm tailscale.tgz
TSPATH=/tmp/tailscale_${VERSION}_${TS_ARCH}
sudo mv "${TSPATH}/tailscale" "${TSPATH}/tailscaled" /usr/bin
shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
env:
VERSION: 1.82.5
SHA256SUM:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 64 100 64 0 0 438 0 --:--:-- --:--:-- --:--:-- 441
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 81 100 81 0 0 615 0 --:--:-- --:--:-- --:--:-- 618
0 30.0M 0 8499 0 0 27476 0 0:19:06 --:--:-- 0:19:06 27476
100 30.0M 100 30.0M 0 0 37.2M 0 --:--:-- --:--:-- --:--:-- 60.4M
tailscale.tgz: OK
Run sudo -E tailscaled --state=mem: ${ADDITIONAL_DAEMON_ARGS} 2>~/tailscaled.log &
sudo -E tailscaled --state=mem: ${ADDITIONAL_DAEMON_ARGS} 2>~/tailscaled.log &
# And check that tailscaled came up. The CLI will block for a bit waiting
# for it. And --json will make it exit with status 0 even if we're logged
# out (as we will be). Without --json it returns an error if we're not up.
sudo -E tailscale status --json >/dev/null
shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
env:
ADDITIONAL_DAEMON_ARGS:
Run if [ -z "${HOSTNAME}" ]; then
if [ -z "${HOSTNAME}" ]; then
HOSTNAME="github-$(cat /etc/hostname)"
fi
if [ -n "***" ]; then
TAILSCALE_AUTHKEY="***?preauthorized=true&ephemeral=true"
TAGS_ARG="--advertise-tags=<tags>"
fi
timeout 5m sudo -E tailscale up ${TAGS_ARG} --authkey=${TAILSCALE_AUTHKEY} --hostname=${HOSTNAME} --accept-routes ${ADDITIONAL_ARGS}
shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
env:
TAILSCALE_AUTHKEY:
ADDITIONAL_ARGS:
HOSTNAME:
TS_EXPERIMENT_OAUTH_AUTHKEY: true
context canceled
Error: Process completed with exit code 124.
Our workflow files have this step:
- name: Activate TailScale
uses: tailscale/github-action@v3
with:
oauth-client-id: ${{ inputs.TS_OAUTH_ID }}
oauth-secret: ${{ inputs.TS_OAUTH_SECRET }}
tags: <tags>
version: latest
use-cache: true
Any thoughts on what could be going on?
I have the same thing happening that started today on my v3 actions. Some of my runners will connect while others will eventually timeout.
Running version 1.82.5
timeout: sending signal TERM to command ‘sudo’
timeout: sending signal KILL to command ‘sudo’
/home/runner/work/_temp/4782142e-bc26-4e1c-a2c8-b88c8d232520.sh: line 15: 399 Killed timeout --verbose --kill-after=1s ${TIMEOUT} ${MAYBE_SUDO} tailscale up ${TAGS_ARG} --authkey=${TAILSCALE_AUTHKEY} --hostname=${HOSTNAME} --accept-routes ${ADDITIONAL_ARGS}
Same here..
I have the same thing happening that started today on my v3 actions. Some of my runners will connect while others will eventually timeout.
Running version 1.82.5
timeout: sending signal TERM to command ‘sudo’ timeout: sending signal KILL to command ‘sudo’ /home/runner/work/_temp/4782142e-bc26-4e1c-a2c8-b88c8d232520.sh: line 15: 399 Killed timeout --verbose --kill-after=1s ${TIMEOUT} ${MAYBE_SUDO} tailscale up ${TAGS_ARG} --authkey=${TAILSCALE_AUTHKEY} --hostname=${HOSTNAME} --accept-routes ${ADDITIONAL_ARGS}
We've also been seeing this sporadically for a few weeks but especially bad in the last few days. The tailscale up command in the action just times out. Also using 1.82.5.
...
timeout --verbose --kill-after=1s ${TIMEOUT} ${MAYBE_SUDO} tailscale up ${TAGS_ARG} --authkey=${TAILSCALE_AUTHKEY} --hostname=${HOSTNAME} --accept-routes ${ADDITIONAL_ARGS}
shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
env:
RESOLVED_VERSION: 1.82.5
TS_ARCH: amd64
SHA256SUM: 41a8931fa52055bd7ea4b51df9acff2ba2d4e9000c2380b667539b5b99991464
ADDITIONAL_ARGS:
HOSTNAME:
TAILSCALE_AUTHKEY:
TIMEOUT: 2m
timeout: sending signal TERM to command ‘sudo’
context canceled
Same
timeout: sending signal TERM to command ‘sudo’
context canceled
Error: Process completed with exit code 124.
Any workaround for this problem?
I'm not sure what the root cause is but something is clearly failing or getting stuck on the tailscale up command, so we just modified the action and implemented retries. It seems to be holding up:
for i in {1..5}; do
echo "Attempt $i to bring up Tailscale..."
timeout --verbose --kill-after=1s ${TIMEOUT} ${MAYBE_SUDO} tailscale up ${TAGS_ARG} --authkey=${TAILSCALE_AUTHKEY} --hostname=${HOSTNAME} --accept-routes ${ADDITIONAL_ARGS} && break
echo "Tailscale up failed. Retrying in $((i * 5)) seconds..."
sleep $((i * 5))
done
We are impacted by this as well. Sporadic failing across our many repos.
The same problem here, getting timeout: sending signal TERM to command ‘sudo’ on 30% of runs.
Any updates on this?
Been happening on 100% of our runs starting today, this action in general has been the number one source of frustrations for us between the binary download issue and now this.
I reached out to Tailscale Support last week and there is an open status page issue for this now (the May 15th one) but I'm still seeing it as well, so no resolution that I'm aware of.
I think the retry approach described above is probably the best workaround available right now.
Hey, thanks for bringing this issue to our attention - and apologies for the recent instability.
@mikepilat is correct, the problem is caused by sporadic latency spikes when new nodes attempt to join a tailnet, which can lead the action to timeout.
We deployed a series of fixes to the platform over the last few days that hopefully made the action more stable. Looking ahead, our platform team is actively working on broader changes to better mitigate these issues. But these improvements are more involved and will take more time to rollout.
In the meantime, we'll incorporate the retry logic suggested by @andrea-armstrong to make the action more resilient against platform hiccups. I'm fully aware "sleep 10" is not ideal, but we hope it helps reduce friction as we work toward a long-term fix.
Hi @mcoulombe ! I see the commit that fixes and adds the retry, but are y'all planning on cutting a new v3.x release of the tailscale action with the addition of the retry? We rely heavily on tailscale and this would definitely help a ton! Thank you in advance for any help 😄
@zestrells v3.2.2 of the action includes the retry