netbird icon indicating copy to clipboard operation
netbird copied to clipboard

Netbird Management HA

Open JaSei opened this issue 1 year ago • 10 comments

Is your feature request related to a problem? Please describe. The problem is that the management component lacks high availability (HA) support. Currently, the management component is central to other components like signal and dashboard, and stores data in a JSON file or, experimentally, in SQLite. However, these storage options cannot be shared across multiple instances. This limitation inhibits the ability to achieve high availability for the whole Netbird.

Describe the solution you'd like Therefore, I propose introducing a new database connector for PostgreSQL as an alternative to SQLite. By adding support for PostgreSQL, the service can become stateless and run seamlessly on container orchestration platforms like Kubernetes or Docker Swarm across multiple instances. This change would enable high availability (HA) by allowing the management component to distribute its load and ensure resilience through redundancy.

Describe alternatives you've considered

  • I tried storing the JSON file and SQLite database on AWS EFS (mounted as NFS) to share storage across multiple instances. However, this approach was unsuccessful as it did not support concurrent access effectively, leading to operational failures in multi-instance setups (only one instance was able to handle requests successfully).
  • Another option is to explore other distributed database solutions or storage mechanisms that support concurrent access and are compatible with the existing architecture. PostgreSQL is the preferred option due to its robustness, scalability, and wide adoption in the industry.

Additional context Achieving high availability for the management component is crucial for ensuring the reliability and scalability of services that depend on NetBirdIO. By enabling the management component to operate across multiple instances without storage bottlenecks, users can leverage container orchestration platforms to achieve better resilience and load distribution. This enhancement would greatly improve NetBirdIO's operational capabilities, particularly in production environments where uptime and scalability are crucial.

JaSei avatar Feb 15 '24 09:02 JaSei

Hi @JaSei, thank you for the request and details explanation. I want to let you know that we have plans for PostgreSQL support, see our public roadmap: https://github.com/netbirdio/netbird/projects/2

surik avatar Feb 15 '24 16:02 surik

That's amazing. Thanks for the roadmap. In this case, this ticket is probably useless.

JaSei avatar Feb 15 '24 17:02 JaSei

Hi. until this is implemented, any other way except replicating the instance with its sql DB. creating another copy with a dedicated relay servers and any change made to one instance be changed also on the replicated one. that way we can use a load balancer and any peer can connect to any instance and have the same routes and policies (of course each instance will use its own relay). i am afraid of relying only on one instance for the whole solution if we plan to replace traditional VPN with NetBird.

ez1976 avatar Apr 09 '24 16:04 ez1976

What happens if the management service goes down for 1min, 1h, 24h? How long do existing connections work? On my test installation established connections are still available. I used the quickstart with keycloak. Besides the backup section, is there any description how to add multiple signal and coturn servers on different hosts? Do additional signal and coturn servers improve availability?

awapf avatar May 03 '24 10:05 awapf

I am wondering, if by default netbird is recommending Zitadel and that spins up a cockroachDB instance, wouldn't it make more sense to leverage cockroachDB rather than Postgres?

Tivin-i avatar May 04 '24 00:05 Tivin-i

I noticed that a netbird client configured to route a local network (e.g. 10.0.0.0/24) will lose its configuration if the management API is down after a time. Not on short outages, but after a longer period of time.

ykorzikowski avatar Jun 19 '24 18:06 ykorzikowski

Has anyone made a distributed sqlite setup? With something like libsql/Turso, rqlite, LiteFS, Cloudflare D1?

ghost avatar Jul 01 '24 20:07 ghost

I succeed to deploy the dashboard/management/signal service in k8s cluster and use keycloak as the IDP. And keycloak and management use the same postgresql instance.

Everything works perfect.

Here is my question:

Can I scale the management or signal service replicas from 1 to 2 or more to help with HA ?

dkrhodes avatar Jul 28 '24 12:07 dkrhodes

There is this piece of doc now about postgres datastore.

I guess this means (even if still in beta/early access) that HA setups are a thing now ?

ednxzu avatar Aug 20 '24 20:08 ednxzu

I had configured Netbird with postgresql on k8s, but if I scale management or signal from 1 to 2, the client can't connect or get correct domain routing. There some configurations that I need to do, or it's not possible in open source version?

Thank you

klinux avatar Aug 21 '24 13:08 klinux