headscale icon indicating copy to clipboard operation
headscale copied to clipboard

Unable to connect to headscale

Open anton-livewyer opened this issue 1 year ago • 7 comments

Bug description

Hi,

First of all want to say "Thank you" for such a cool product you build!

We have faced some issue with headscale server multiple times. The issue is that when tailscale client tries to connect to the headscale it gets could not register machine error returned in the browser (we use OIDC with Google provider). What happens on the server level at that time is:

  1. the error ERR Failed to persist/update machine in the database error="database is locked (5) (SQLITE_BUSY)" handler=PollNetMap machine=<NODE_NAME> appears in the log
  2. then the ERR Failed to persist/update machine in the database error="SQL logic error: cannot start a transaction within a transaction (1)" handler=PollNetMap machine=<NODE_NAME> error message just spamming the server log with different machine names in <NODE_NAME> field

I find it hard to say what exactly causing this but can definitely say that two times it happened after our two separate users updated their local macos tailscale clients to the latest version and were unable to connect to server after that. Also after this issue appears all the users who do a reconnect to the server get the same error. So from my understanding headscale tries to write some data to database when user connects but it's unable to do that because of the locked database.

According to the log message header the issue is linked to this function

Environment

  • OS: Ubuntu 20.04
  • Headscale version: 0.20.0
  • Tailscale version: definitely occurred on MacOS 1.38.1
  • Database: .sqlite file stored on the server
  • [ ] Headscale is behind a (reverse) proxy
  • [ ] Headscale runs in a container

To Reproduce

We were trying to reproduce this by downloading the older version of tailscale client (both windows and linux, don't have ability to test on mac), connecting to server and then updating the client to the latest version proposed by tailscale as it is the only way we are aware of that can cause this issue but we didn't have any success reproducing it so from my understanding this is the bug

anton-livewyer avatar Jan 11 '24 14:01 anton-livewyer

@anton-livewyer latest stable version is 0.22.3, is this reproducable in that version?

TotoTheDragon avatar Feb 11 '24 19:02 TotoTheDragon

We are seeing this on 0.22.3. Not sure if it's a coincidence, but a lot of our users upgraded their tailscale clients from 1.56.x client to 1.58.x today

sthomson-wyn avatar Feb 13 '24 18:02 sthomson-wyn

I'll also mention that this seems to occur after we restart our headscale deployment in kubernetes. I imagine that any brief overlap between pod uptimes may be the cause of db locking

sthomson-wyn avatar Feb 13 '24 18:02 sthomson-wyn

I'll also mention that this seems to occur after we restart our headscale deployment in kubernetes. I imagine that any brief overlap between pod uptimes may be the cause of db locking

Yes, makes a lot of sense. I do not assume this would be fixed in v0.22, but I will make a ticket to make sure database is properly closed on kill in v0.23.

For your current use case switching to postgres might be a viable solution to the locking problem.

TotoTheDragon avatar Feb 13 '24 18:02 TotoTheDragon

We're currently switching to using a Statefulset instead of a Deployment (should've done that in the first place) to address the overlap.

Postgres is a good idea, we'll do that later too. Thanks @TotoTheDragon

sthomson-wyn avatar Feb 13 '24 18:02 sthomson-wyn

We're currently switching to using a Statefulset instead of a Deployment (should've done that in the first place) to address the overlap.

Postgres is a good idea, we'll do that later too. Thanks @TotoTheDragon

Alright, whrn you have tested a new environment please let us know if anything has changed

TotoTheDragon avatar Feb 13 '24 18:02 TotoTheDragon

This issue is stale because it has been open for 90 days with no activity.

github-actions[bot] avatar May 14 '24 01:05 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar May 21 '24 01:05 github-actions[bot]