github-action "failed to connect to local tailscaled" on self-hosted runners

Running my workflow on github's hosted runner works fine, but when I switch to self-hosted runners (using a DinD approach and ultimately running in a nimmis/ubuntu:latest container) I'm finding that the action fails (see GitHub workflow output):

  sudo -E tailscaled --state=mem: ${ADDITIONAL_DAEMON_ARGS} 2>~/tailscaled.log &
  # And check that tailscaled came up. The CLI will block for a bit waiting
  # for it. And --json will make it exit with status 0 even if we're logged
  # out (as we will be). Without --json it returns an error if we're not up.
  sudo -E tailscale status --json >/dev/null
  shell: bash --noprofile --norc -e -o pipefail {0}
  env:
    ADDITIONAL_DAEMON_ARGS: 
  
failed to connect to local tailscaled; it doesn't appear to be running (sudo systemctl start tailscaled ?)
Error: Process completed with exit code 1.

I think the issue may have something to do with the CLI not blocking as expected, since as you can see in the image below with the timestamps, the entire thing is completed in seconds.

The only change I made to the nimmis/ubuntu:latest image was to run:

apt-get update
apt-get install -y sudo --fix-missing

since sudo was not installed by default since the only user is root.

~~Maybe some other error is being thrown by sudo -E tailscale status --json but it's being swallowed by the >/dev/null ?~~

EDIT: No, it's not redirecting stderr, only stdout... 🤔

Oct 21 '24 18:10 jasonbecker-os

I tried copying the action into my repo so I could fiddle with it and have found that even after removing the sudos (since it's being run by root anyways) and adding a retry loop around the tailscale status call, it is still failing to find the tailscaled process (see GitHub workflow output):

  set -xv
  if [ "$STATEDIR" == "" ]; then
    STATE_ARGS="--state=mem:"
  else
    STATE_ARGS="--statedir=${STATEDIR}"
    mkdir -p "$STATEDIR"
  fi
  tailscaled ${STATE_ARGS} ${ADDITIONAL_DAEMON_ARGS} 2>~/tailscaled.log &
  # And check that tailscaled came up. The CLI will block for a bit waiting
  # for it. And --json will make it exit with status 0 even if we're logged
  # out (as we will be). Without --json it returns an error if we're not up.
  
  # Retry mechanism for tailscale status
  for i in {1..10}; do
    tailscale status --json >/dev/null && break || sleep 5
  done
  shell: bash --noprofile --norc -e -o pipefail {0}
  env:
    ADDITIONAL_DAEMON_ARGS: 
    STATEDIR: 
if [ "$STATEDIR" == "" ]; then
  STATE_ARGS="--state=mem:"
else
  STATE_ARGS="--statedir=${STATEDIR}"
  mkdir -p "$STATEDIR"
fi
+ '[' '' == '' ']'
+ STATE_ARGS=--state=mem:
tailscaled ${STATE_ARGS} ${ADDITIONAL_DAEMON_ARGS} 2>~/tailscaled.log &
# And check that tailscaled came up. The CLI will block for a bit waiting
# for it. And --json will make it exit with status 0 even if we're logged
# out (as we will be). Without --json it returns an error if we're not up.
# Retry mechanism for tailscale status
for i in {1..10}; do
  tailscale status --json >/dev/null && break || sleep 5
done
+ for i in {1..10}
+ tailscale status --json
+ tailscaled --state=mem:
failed to connect to local tailscaled; it doesn't appear to be running (sudo systemctl start tailscaled ?)
+ sleep 5

(I removed the repeated retries for brevity)

Oct 21 '24 20:10 jasonbecker-os

Ah, here's something useful! I turned off the redirect stderr to file by removing 2>~/tailscaled.log and got this additional output:

2024/10/21 20:56:59 logtail started
2024/10/21 20:56:59 Program starting: v1.72.1-tc02a15244-g5c00d019b, Go 1.22.5: []string{"tailscaled", "--state=mem:"}
2024/10/21 20:56:59 LogID: 6c0a127189e4fc2f4a4bc59fbb8ed0b5597478fcce428b747951758e5da[99](https://github.com/openspacelabs/openspace/actions/runs/11448472642/job/31852107429#step:4:102)be1
2024/10/21 20:56:59 logpolicy: using system state directory "/var/lib/tailscale"
logpolicy.ConfigFromFile /var/lib/tailscale/tailscaled.log.conf: open /var/lib/tailscale/tailscaled.log.conf: no such file or directory
logpolicy.Config.Validate for /var/lib/tailscale/tailscaled.log.conf: config is nil
2024/10/21 20:56:59 dns: [rc=unknown ret=direct]
2024/10/21 20:56:59 dns: using "direct" mode
2024/10/21 20:56:59 dns: using *dns.directManager
2024/10/21 20:56:59 linuxfw: clear iptables: exec: "iptables": executable file not found in $PATH
2024/10/21 20:56:59 linuxfw: clear ip6tables: exec: "ip6tables": executable file not found in $PATH
2024/10/21 20:56:59 cleanup: list tables: netlink receive: operation not permitted
2024/10/21 20:56:59 wgengine.NewUserspaceEngine(tun "tailscale0") ...
2024/10/21 20:56:59 Linux kernel version: 6.1.[109](https://github.com/openspacelabs/openspace/actions/runs/11448472642/job/31852107429#step:4:112)
2024/10/21 20:56:59 is CONFIG_TUN enabled in your kernel? `modprobe tun` failed with: 
2024/10/21 20:56:59 tun module not loaded nor found on disk
2024/10/21 20:56:59 wgengine.NewUserspaceEngine(tun "tailscale0") error: tstun.New("tailscale0"): CreateTUN("tailscale0") failed; /dev/net/tun does not exist
2024/10/21 20:56:59 flushing log.
2024/10/21 20:56:59 logger closing down
2024/10/21 20:56:59 getLocalBackend error: createEngine: tstun.New("tailscale0"): CreateTUN("tailscale0") failed; /dev/net/tun does not exist

So the problem appears to be that it's expecting TUN to exist and it's swallowing errors if it's not.

Oct 21 '24 21:10 jasonbecker-os

For posterity: I found I was able to get around this by adding the TUN device to my container in my workflow, like so:

container:
  image: <image>
  options: --cap-add=NET_ADMIN --device=/dev/net/tun

If there's an action item to come out of this though, it's that the tailscaled command should probably share stderr with console, via either something like tee which writes to both a file and console at the same time, or adding a line to print the contents of the tailscaled.log file if it's not empty.

Oct 25 '24 22:10 jasonbecker-os

I'm having the same issue, is there any other workaround?

Nov 27 '24 19:11 bryan-rhm

@bryan-rhm

I'm having the same issue, is there any other workaround?

See my previous comment. In my case, I just had to add those options to the workflow file.

Nov 27 '24 21:11 jasonbecker-os

In my case I fixed it by modifying the runner image to mount the host /dev/net/tun to the runner container, as I am using gha-runner-scale-set in kubernetes.

This is the container template for it the scaleset: https://github.com/jamezrin/personal-actions-runner-setup/blob/main/gha-runner-scale-set-dind-fix.yaml

@jasonbecker-os gave a very good point about the stdout and stderr streams, I issued this PR which was enough to see the necessary output to debug the issue https://github.com/tailscale/github-action/pull/154

Jan 01 '25 00:01 jamezrin

This may or may not work for self-hosted runners, but since the error messages are the same when running on nektos/act, it sounds like it might work without having to modify the container: https://github.com/tailscale/github-action/issues/120#issuecomment-2571469160

Jan 05 '25 02:01 Pikachews