headscale icon indicating copy to clipboard operation
headscale copied to clipboard

[Feature] In-Memory State Management

Open CodingTil opened this issue 8 months ago • 2 comments

Use case

Under certain loads, using a database for state management in headscale represents a performance bottleneck. To boost performance while ensuring database persistence, we propose to implement an in-memory state management layer.

Description

Profiling the server under certain loads revealed that most CPU time is spent on database operations, causing unresponsiveness and even node disconnections. This was observed on commit 0d3134720ba96e9719bab886525e175c5cfe0147.

Image Image Image

Our previous attempts to accelerate DB read operations include:

  • Increasing the SQLite connection pool (#2571)
  • Replacing GORM's default JSONSerializer with a faster one using sonicJSON (references #2513)
  • Caching peers in the Mapper (a hack, only eventually consistent with the DB state)

Although these changes had a positive impact, the overall outcome was still unsatisfactory. Inspired by your in-code comments and proposals on other issues (e.g., https://github.com/juanfont/headscale/blob/d7a503a34effa188e9bb27cb6b0fad2002112fb0/hscontrol/app.go#L529 and https://github.com/juanfont/headscale/issues/2571#issuecomment-2858966975), we considered a reevaluation of the database access setup.

Contribution

  • [x] I can write the design doc for this feature
  • [x] I can contribute this feature

How can it be implemented?

Our proposal concerns a new in-memory state component, sitting between the server and the database. This component should have the following properties:

  • Contain all relevant data for headscale server operation in memory, currently residing in the database
  • Persist state:
    • Initialize state from the DB
    • Write to the DB when necessary (to be determined)
  • Allow concurrent reads of the state

Example struct:

// State holds the global state of the headscale server.
type State struct {
	// state is persisted in the database
	db *db.HSDatabase

	// concurrent reads are permitted
	mutex sync.RWMutex

	// ground truth data
	preauthKeys []types.PreAuthKey
	nodes       types.Nodes
	users       types.Users
}

We propose one of the following DB-Update schemes:

  1. Immediate DB Updates: Update the DB immediately when the server updates the state.
  2. Scheduled DB Updates: Update the in-memory state immediately, and dump it to the DB every X seconds (e.g., every minute or two) or on SIGTERM.

While individual immediate DB updates are smaller and faster, they may result in a larger number of updates over time. Scheduled DB updates might be simpler to implement and result in more readable code, making them our preferred choice.

If you agree with our proposal, we (@aergus-tng @Enkelmann @CMS-TNG @JanisCasper and me, and potentially a few more colleagues) would like to implement this.

CodingTil avatar May 15 '25 07:05 CodingTil

Interesting. Seems that Web UIs (like Headscale-Admin) which poll the API continually for state changes would benefit as well.

Codelica avatar May 15 '25 20:05 Codelica

Although these changes had a positive impact, the overall outcome was still unsatisfactory.

I appreciate your help so far and I am sorry that it is still unsatisfactory.

We are aware of that the database is a large bottleneck and it is in our plans to address this. Per now, we are on a very good trajectory for correcting a lot of inconsistencies and faults in Headscale and the next things on the roadmap is going to continue to focus on this.

I have plans to start work on this in the future, but for now, we will continue with our plan to "get things right, then make it faster" as there is a lot of moving parts involved in this and parts of the other changes will make it easier to achieve.

I imagine that we will get to it in 3-4 releases, after tags, autogroups, tls/serve.

For now, this is an effort that we really need to be on top of ourselves, there are lots of moving parts, we have to maintain it over time, we need to design it, so for now we will keep this on hold. I am very happy for your efforts and happily will accept other smaller changes as you find them.

And when the time comes for this implementation, feedback and code review would be greatly appreciated!

kradalby avatar May 16 '25 12:05 kradalby

This issue is stale because it has been open for 90 days with no activity.

github-actions[bot] avatar Aug 15 '25 02:08 github-actions[bot]

And initial version of this has been merged in #2670, have not benched marked it, so it could potentially have moved the bottleneck around, but it is a step in the right direction.

kradalby avatar Sep 09 '25 07:09 kradalby