lemmy icon indicating copy to clipboard operation
lemmy copied to clipboard

[Architecture] Hub-Spoke model for federation?

Open XtremeOwnageDotCom opened this issue 1 year ago • 19 comments

Question

One of the biggest issues I am currently seeing with Lemmy, is federation.

Either...

  1. Federation between instances having issues.
  2. Federation is backed up.
  3. Federation / Syncing is not scaling.

Etc.

Currently, federation performs in a full-mesh topology. Instances, all talk to other instances.

Full mesh image: image

Why is this a problem?

As the number of instances scales up, this will lead to federation/replication traffic increasing exponentially.

ie- It does not scale well.

It is the reason TCP/IP was originally broken up into separate broadcast domains. It is why we have subnets.

To add some math for clarification- lets say, we have 10,000 instances.

To calculate the number of required connections for a full mesh, the formula is w = n * ( n – 1) / 2

10,000 * (10,000 - 1) / 2 = 49,995,000 required connections.

What if- a hub-spoke topology was adopted for federation?

Instead, of instances talking to each other- instead, instances talk to "hub" servers.

Hub Spoke Image: image

The hub servers, would need to be big, beefy servers, and would only run services for content replication, syncing, and federation. They would all replicate amongst each other as well, in a full-mesh.

If implemented, instance -> fediverse replication/federation issues would be greatly minimized.... As, either everything works, or nothing works.

Using the same variables above, lets still assume, there are 10,000 instances.

Using hub-spoke, lets say, each instance only needs to maintain concurrent connectivity to a single hub. (But- keeps a list of backup hubs, in the event the main hub goes down).

This- only requires 10,000 connections. (Plus- NumberOfHubs * (NumberOfHubs - 1) / 2) This is significantly less than the 50 million connections from the current full mesh topology.

In the current example, this is a 99.98% reduction in the number of concurrent connections.

Benefits

  1. Amount of federation/replication traffic would be greatly decreased, as individual instances only needs to sync with a hub server.
  2. Federation/replication issues vastly minimized, as, either everything works, or nothing works.
  3. Barrier to entry vastly reduced as well for new instances.
  4. Potential to allow for a centralized "directory/phonebook" for instances/communities, which can be integrated into the UI, for allowing users to EASILY discover/subscribe to new communities (as opposed to needing to discover the community through google, or other tools)

Potential downsides

  1. Introduction of a new failure domain.
  2. Hub servers would need to be trusted/open.
  3. Hub servers need to be scaled to properly handle load.

Potential Mitigations to downsides-

  1. Instances can OPTIONALLY choose to use hubs, rather then full meshing. This eliminates most of the concerns.
  2. Instance admin capability of decide if individual communities should be directly-connected/federated, or used through a hub.

Alternatives?

  • Can limit lemmy to only a few handfuls of main instances. This, defeats most of the benefits of this platform, however.
  • Strictly limiting federation to massively reduce load. This- can work, however, this also ruins many benefits of the platform, and vastly decreases user experience. Would not recommend this.
  • Optimizations of the current implementation can band-aid this issue, but, only up to a point.

XtremeOwnageDotCom avatar Jun 21 '23 15:06 XtremeOwnageDotCom