lemmy [Architecture] Scaling federation throughput via message queues and workers

Requirements

[X] Is this a feature request? For questions or discussions use https://lemmy.ml/c/lemmy_support
[X] Did you check to see if this issue already exists?
[X] Is this only a feature request? Do not put multiple feature requests in one issue.
[x] Is this a UI / front end issue? Use the lemmy-ui repo.

(checking that check box even though it is not a front end issue, and I will not be using lemmy ui repo for this)

Is your proposal related to a problem?

Larger instances currently are all backed up with outbound federation. Lemmy.world has bumped the federation worker count up to 10000, and outbound federation events are still reaching other servers outside of the 10 seconds window. While I have no direct proof this being the issue presently, the symptoms would suggest outbound federation events are backed up, be it due to straight up too many messages to send out (i.e.: every "write" interaction = 1 event * how many ever federated servers) or worse yet, previously federated servers shutting down, and federation worker must wait for network timeout before they can proceed with sending another federation event message.

Describe the solution you'd like.

Federation workers needs to be scalable independent of the Lemmy back-end, and perhaps managed via some sort of message bus/queue. Instead of immediately emit an out going federation event message to the federated server, the event gets put into a message queue (for example, maybe a separate RabbitMQ container). Then, have separate scalable container representing Federation worker that can be spun up in multiple server/VMs, or just multiple instances, which will fetch the federation event message, and deliver to the intended federated server. This way, larger instances can have a fleet of workers not bound by TCP "half open" limits per server, and scale outbound federation event messages better.

Describe alternatives you've considered.

As a temporary solution, increasing the 10 seconds expiration window might help these larger communities scale a little bit more before we need more advanced solutions such as the one described above.

Additional context

No response

Jun 20 '23 21:06 huanga

I see a change in activitypub-federation-rust (line 70) which may increase the signature duration to 5 minutes. This change is welcomed and should help alleviate the current delay issue. I don't know if that will be suffice if bulk of Reddit were to migrate to Lemmy, so I'll leave this issue open for now, and we can re-evaluate again when we see more growth of the network.

Jun 21 '23 13:06 huanga

I've been building messaging systems professionally since 1985, including extensive work with near real-time transactions using PostgreSQL as a backend.

I agree that it needs to shift away from directly spawning tasks and go to a queue system like an e-mail MTA would do. Both incoming and outgoing (where this discussion is focused primarily on outgoing so far, incoming is also overwhelming the database and degrading interactive website and API users).

As I understand it, every single new comment and posting is creating two independent http and database INSERT transactions: one for the content, a second for the Like. These are also rather inflated content wrapped in a lot of boilerplate ActivePub syntax. The overhead of this is very high - and without some queue management tools for server operators to really see what is happening on their systems, hidden.

For the record, I did start this topic out as a Lemmy discussion posting before this Github issue was opened: https://lemmy.ml/post/1277404

Jun 21 '23 15:06 RocketDerp

Thank you for chiming in!

I agree, we need queues, and just so I'm clear... I'm not prescribing solutions such as RabbitMQ, just that we'd need a separate independently scalable component. The separation is kind of important in my mind, as it allows for independent maintenance windows and scaling adjustments.

Also, I'm so glad to see I'm not the only one thinking about these! I'll try to add more thoughts on your Lemmy post instead of cluttering up here!

Jun 21 '23 15:06 huanga

@huanga check- out my proposal too- https://github.com/LemmyNet/lemmy/issues/3245

I think the best way to help this issue, is to reduce the exponential growth of federation traffic that comes from using a full-mesh topology. It just doesn't scale. And- growth is not going to slow down anytime soon.

Jun 21 '23 17:06 XtremeOwnageDotCom

Another soultion I thought about is to have an instance that has to send an update to 10k other federated instances. It says "ok, instance #1, instance#2.. instance#10, this is the update you have to forward to those 10 instances".

In this way the main instance doesn't need to make 10k HTTP requests but only 1k in this case and the other 10 instances the main instance depicted as "forward nodes" can do the same thing with their own federated instances or can directly send everything, depending on a logic like a threshold (if > 1000, split, if < 1000 send). So you have like a "job scheduling" between instances.

Jun 21 '23 17:06 thegabriele97

Hey! Since no one has mentioned it I wanted to say that the outgoing federation logic has been replaced with a different, hopefully better queue here: https://github.com/LemmyNet/activitypub-federation-rust/pull/42

Instead of immediately emit an out going federation event message to the federated server

With debug mode off it was already a (suboptimal) queue in 0.17, and in 0.18 it should be a pretty efficient (in-memory) queue.

I'd also say that right now there's (probably) a lot of low hanging fruit still to be picked before moving to much more complex systems that have to involve horizontal scaling with multiple servers. Of course that also has to happen at some point, but it's a huge amount of effort compared to the improvements that can still be done to improve the single-instance scaling.

If you just move to an external messaging queue you're not going to actually reduce the amount of work that has to be done per time unit. Rust will already reach the maximum scalability you can get out of one server with an in-memory queue (if written well).

I think the problems are mostly

Server operators can barely tell what's happening (partially because the code doesn't report much what it's doing)
The code still has low-hanging fruit that should be picked before adding more new solutions.

Many things server operators are saying can't be explained just by the server hardware and general architecture being limiting. You can easily do 100+ inserts per second into a single table in postgresql, if inserts take 0.4 seconds on average something is very wrong somewhere else.

Stuff like the long-duration locks for the hot_rank batch updates that locked all comments at the same time, or having some part be single-threaded just because someone forgot to turn off the debug flag for weeks.
ActivityPub in general just scales horribly. I don't think there's a real way around this.

If you have 100 federated servers, for a single post with 100 comments, with 100 votes each, you will get around 100 * 100 * 100 = 1 million outgoing POST requests!! And 10k incoming requests with at least one SQL query each for each of the servers. The main thing that could be done here is 1. send votes as aggregates (which makes faking votes much easier) 2. prioritize the outgoing queue that if it grows to a certain size it starts dropping based on priorities (dropping vote updates should reduce the load by at least 10x from what I understand)

Jun 23 '23 15:06 phiresky

Just as another example of low hanging fruit that we should pick before looking at exciting and complex new systems, in 0.17 the PEM key was loaded and parsed for every outgoing http request (so basically a million times for a single popular post). The rust code also does a ton of unnecessary sql selects that it then doesn't even use.

So I'd encourage everyone to work on improving the plumbing (and tooling) before thinking about adding new complexity.

Jun 23 '23 15:06 phiresky

if inserts take 0.4 seconds on average something is very wrong somewhere else.

This was a mistake, it's less than 1ms. Units was read wrong, sorry.

Server operators can barely tell what's happening (partially because the code doesn't report much what it's doing)

Yes. I have so many questions what is going on inside these busy servers. Is the 10 second HTTP timeout (REQWEST_TIMEOUT) I'm seeing being hit? etc.

Jun 27 '23 01:06 RocketDerp

The rust code also does a ton of unnecessary sql selects that it then doesn't even use.

Can we get started on this? Where, here on Github?

With federation incoming being the primary activity of my server, and no active local users, I'm seeing this query as the most run on the system: SELECT "person"."id", "person"."name", "person"."display_name", "person"."avatar", "person"."banned", "person"."published", "person"."updated", "person"."actor_id", "person"."bio", "person"."local", "person"."private_key", "person"."public_key", "person"."last_refreshed_at", "person"."banner", "person"."deleted", "person"."inbox_url", "person"."shared_inbox_url", "person"."matrix_user_id", "person"."admin", "person"."bot_account", "person"."ban_expires", "person"."instance_id" FROM "person" WHERE (("person"."deleted" = $1) AND ("person"."actor_id" = $2)) LIMIT $3

Anyone found in the code where incoming federation would do this lookup? thank you.

Jun 27 '23 01:06 RocketDerp

@RocketDerp I am going to assume its being called from this code: https://github.com/LemmyNet/lemmy/blob/main/crates/db_views_actor/src/person_view.rs

And- I am going to assume, when federation messages are incoming, for new comments/posts/edits/etc- the database needs to lookup the actor ID, to translate that into a local person ID.

I'd say the best way to mitigate that load, is to bundle redis with lemmy, and use it to cache actor_ids, and other IDs, to remove that load from the database.

Especially- for instances processing hundreds of thousands of federation messages in a short time. Redis scales well- and it could excel at that. Very easy to get up and going too.

Jun 27 '23 02:06 XtremeOwnageDotCom

I don't know if we'd necessarily need Redis at this point since the query should be fairly light at this time... probably 6 figures worth of person ID records, which Postgres should be able to handle easily. However, long term, Redis will be very useful as a query cache layer for all frequent queries, especially busy threads for public instances.

Jun 27 '23 02:06 huanga

Back to the topic of the posting title, federation queues.

I recommend that the project immediately have an admin API on some of the internal state and statistics. It's getting dumped into the system log, but I think it should be put in a form that is accessible to the webapps for server admins - and I think even some (if not all?) exposed to peers/anyone.

Starting with this data structure: https://github.com/LemmyNet/activitypub-federation-rust/blob/325f66ba324037a4f1d330a0dbea6e062ba34f50/src/activity_queue.rs#L265

I would add a timestamp to that structure exactly what time the 'hour' scheduled task ran to purge. In that hourly task, write the data also to an admin_monitor table in the database as another way to have it more than just in the system log?

I think 'accessible logging' is important and keeping some statistics of even how many outbound and inbound federation activities are taking place, per peer instance. Write to sequential text files? A second database like SQLite to offload from PostgreSQL? Something that server operators can study for long-term capacity planning, debugging, and some webpage/API client screens so they can know what the server is doing.

Personally I like the SQLite option, as we can query it and build an /server_api to output JSON from some queries about queue status, federation activity to peers, etc.

Jun 27 '23 11:06 RocketDerp

I think since the main devs are pretty overloaded your best will be to just start with small, "unoffensive" PRs and go from there. e.g. just add a really minimal admin API endpoint with the stats you mentioned, I think they would merge that.

Jun 27 '23 23:06 phiresky

Two version 0.18.0 servers are showing major loss of comments, replication is failing hard in the past 36 hours. I posted a side by side comparison of two servers. This is in crisis: https://github.com/LemmyNet/lemmy/issues/3101#issuecomment-1610363932

Jun 28 '23 00:06 RocketDerp

Should be resolved with https://github.com/LemmyNet/lemmy/pull/3605

Sep 28 '23 10:09 Nutomic

lemmy lemmy copied to clipboard

[Architecture] Scaling federation throughput via message queues and workers

Requirements

Is your proposal related to a problem?

Describe the solution you'd like.

Describe alternatives you've considered.

Additional context

lemmy
lemmy copied to clipboard