netbird
netbird copied to clipboard
Netbird Management HA
Is your feature request related to a problem? Please describe. The problem is that the management component lacks high availability (HA) support. Currently, the management component is central to other components like signal and dashboard, and stores data in a JSON file or, experimentally, in SQLite. However, these storage options cannot be shared across multiple instances. This limitation inhibits the ability to achieve high availability for the whole Netbird.
Describe the solution you'd like Therefore, I propose introducing a new database connector for PostgreSQL as an alternative to SQLite. By adding support for PostgreSQL, the service can become stateless and run seamlessly on container orchestration platforms like Kubernetes or Docker Swarm across multiple instances. This change would enable high availability (HA) by allowing the management component to distribute its load and ensure resilience through redundancy.
Describe alternatives you've considered
- I tried storing the JSON file and SQLite database on AWS EFS (mounted as NFS) to share storage across multiple instances. However, this approach was unsuccessful as it did not support concurrent access effectively, leading to operational failures in multi-instance setups (only one instance was able to handle requests successfully).
- Another option is to explore other distributed database solutions or storage mechanisms that support concurrent access and are compatible with the existing architecture. PostgreSQL is the preferred option due to its robustness, scalability, and wide adoption in the industry.
Additional context Achieving high availability for the management component is crucial for ensuring the reliability and scalability of services that depend on NetBirdIO. By enabling the management component to operate across multiple instances without storage bottlenecks, users can leverage container orchestration platforms to achieve better resilience and load distribution. This enhancement would greatly improve NetBirdIO's operational capabilities, particularly in production environments where uptime and scalability are crucial.
Hi @JaSei, thank you for the request and details explanation. I want to let you know that we have plans for PostgreSQL support, see our public roadmap: https://github.com/netbirdio/netbird/projects/2
That's amazing. Thanks for the roadmap. In this case, this ticket is probably useless.
Hi. until this is implemented, any other way except replicating the instance with its sql DB. creating another copy with a dedicated relay servers and any change made to one instance be changed also on the replicated one. that way we can use a load balancer and any peer can connect to any instance and have the same routes and policies (of course each instance will use its own relay). i am afraid of relying only on one instance for the whole solution if we plan to replace traditional VPN with NetBird.
What happens if the management service goes down for 1min, 1h, 24h? How long do existing connections work? On my test installation established connections are still available. I used the quickstart with keycloak. Besides the backup section, is there any description how to add multiple signal and coturn servers on different hosts? Do additional signal and coturn servers improve availability?
I am wondering, if by default netbird is recommending Zitadel and that spins up a cockroachDB instance, wouldn't it make more sense to leverage cockroachDB rather than Postgres?
I noticed that a netbird client configured to route a local network (e.g. 10.0.0.0/24) will lose its configuration if the management API is down after a time. Not on short outages, but after a longer period of time.
Has anyone made a distributed sqlite setup? With something like libsql/Turso, rqlite, LiteFS, Cloudflare D1?
I succeed to deploy the dashboard/management/signal service in k8s cluster and use keycloak as the IDP. And keycloak and management use the same postgresql instance.
Everything works perfect.
Here is my question:
Can I scale the management or signal service replicas from 1 to 2 or more to help with HA ?
There is this piece of doc now about postgres datastore.
I guess this means (even if still in beta/early access) that HA setups are a thing now ?
I had configured Netbird with postgresql on k8s, but if I scale management or signal from 1 to 2, the client can't connect or get correct domain routing. There some configurations that I need to do, or it's not possible in open source version?
Thank you