lemmy icon indicating copy to clipboard operation
lemmy copied to clipboard

[Bug]: Federation throughput in a synchronous manner is limited to network distance

Open ticoombs opened this issue 11 months ago • 59 comments

Requirements

  • [X] Is this a bug report? For questions or discussions use https://lemmy.ml/c/lemmy_support
  • [X] Did you check to see if this issue already exists?
  • [X] Is this only a single bug? Do not put multiple bugs in one issue.
  • [X] Do you agree to follow the rules in our Code of Conduct?
  • [X] Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.

Summary

Problem: Activities are sequential but requires external data to be validated/queried that doesn't come with the request. Server B -> A, says here is an activity. In that request can be a like/comment/new post. An example of a new post would mean that Server A, to show the post metadata (such as subtitle, or image) queries the new post.

Every one of these outbound requests that the receiving server does are:

  • Sequential, (every request must happen in order: 1,2,3,4...
  • Is blocking. Server B which sent a message to server A, must wait for Server A to say "I'm Finished" before sending the next item in queue.
  • Are inherently subsequent to network latency (20ms to 600ms)
    • Australia to NL is 278ms (round trip 556ms)
    • NL to LA is 145ms (round trip 290ms)
    • I picked NL because it is geographically, and literally, on the other side of the world from Australia. This is (one of) if not the longest route between two lemmy servers.

Actual Problem

So every activity that results in a remote fetch delays activities. If the total activities that results in more than 1 per 0.6s, servers physically cannot and will never be able to catch up. As such our decentralised solution to a problem requires a low-latency solution. Without intervention this will evidently ensure that every server will need to exist in only one region. EU or NA or APAC (etc.) (or nothing will exist in APAC, and it will make me sad) To combat this solution we need to streamline activities and how lemmy handles them.

Steps to Reproduce

  1. Have a lemmy server in NL send activities faster that 1 request every 0.6 seconds to a lemmy server in australia.
  2. If you send New Post activities, they can affect the activity processing the most / are the longest to help validate the PoC.

Technical Details

Trace 1:

Lemmy has to verify a user (is valid?). So it connects to a their server for information. AU -> X (0.6) + time for server to respond = 2.28s but that is all that happened.

- 2.28s receive:verify:verify_person_in_community: activitypub_federation::fetch: Fetching remote object http://server-c/u/user
- request completes and closed connection

Trace 2:

Similar to the previous trace, but after it verfied the user, it then had to do another from_json request to the instance itself. (No caching here?) As you can see 0.74 ends up being the server on the other end responding in a super fast fashion (0.14s) but the handshake + travel time eats up the rest.

- 2.58s receive:verify:verify_person_in_community: activitypub_federation::fetch: Fetching remote object http://server-b/u/user
- 0.74s receive:verify:verify_person_in_community:from_json: activitypub_federation::fetch: Fetching remote object http://server-b/
- request continues

Trace 3:

Fetching external content. I've seen external servers take upwards of 10 seconds to report data, especially because whenever a fediverse link is shared, every server refreshes it's own data. As such you basically create a mini-dos when you post something.

- inside a request already
- 4.27s receive:receive:from_json:fetch_site_data:fetch_site_metadata: lemmy_api_common::request: Fetching site metadata for url: https://example-tech-news-site/bitcoin-is-crashing-sell-sell-sell-yes-im-making-a-joke-here-but-its-still-a-serious-issue-lemmy-that-is-not-bitcoin

Trace 4:

Sometimes a lemmy server takes a while to respond for comments.

- 1.70s receive:community: activitypub_federation::fetch: Fetching remote object http://server-g/comment/09988776

Version

0.19.3

Lemmy Instance URL

No response

ticoombs avatar Mar 13 '24 10:03 ticoombs

Just collecting some further points from discussions on Matrix:

  • The core issue is that even if Lemmy did absolutely no processing for incoming events, a ping of ~300ms still limits us to ~3 incoming activities per second. This is right on the edge of what lemmy.world is generally sending out, so any kind of extra delays on top of this (or growth from lemmy.world) will guarantee falling behind.
  • Sending/receiving activities in parallel without any further logic (as was done before 0.19) can speed things up, but will make federation unreliable once again (as it was before 0.19) - different instances will be very likely to show different versions of posts, comments, votes, etc, as all edits and vote changes will not propagate in the correct order through the network.
  • Splitting up the queue on the sending side (for example, by the community where the activity is in) could improve the situation on the receiving side, but would add a lot of overhead on the sending side
  • Batching activities together would be ideal, as it would allow us to maintain correct ordering, but it appears that ActivityPub does not support such a thing at the moment
  • @phiresky mentioned the option of switching to unreliable federation (as in parallelized, without trying to guarantee the correct order of activities) whenever the sender detects they are falling behind - this is perhaps the quickest band-aid solution, with the trade-off of still creating opportunities for instances to go out of sync

sunaurus avatar Mar 13 '24 10:03 sunaurus

Here are some potential solutions.

  • Apparently our HTTP client doesnt keep connections alive. So by enabling that and other options we can get rid of some overhead. Relevant code and documentation.
  • Lemmy processes incoming activities immediately, which can be slow when additional data needs to be fetched. Mastodon and other platforms instead put received activities in a queue for later processing. So we can also consider it.
  • A major fraction of all activities are probably votes. We could mark votes (or only comment votes) as low priority in the outgoing federation queue. Then if federation falls behind, drop some of the vote activities. Similar to https://github.com/LemmyNet/lemmy/issues/4527 but it can probably be automatic.
  • Batching multiple activities in a single request would be ideal, but Im not aware of any previous work in this direction. It would be good to open an FEP to discuss this and get input from other Fediverse devs.

Nutomic avatar Mar 13 '24 11:03 Nutomic

Solution through brainstorming on matrix to parallelize sending requests

  • Pick up the last X requests needing to be sent to target server. Where X = 10 (with 10 adjustable), or perhaps more dynamically X = ceil(queue_size / 1000) (with 1000 adjustable). Currently X == 1.
  • For every request we pickup, split it into a different thread per parent post or add to the "parentless queue". So let's say I pick these 10
activity parent ID
post A post A 1
comment A post A 2
vote A post A 3
vote B post B 4
ban user parentless 5
edit comment A (1 attempt) post A 6
edit comment A (2 attempt) post A 7
edit comment B post B 8

We end up with the following sending queues

Queue A ids: 1,2,3,6,7 Queue B ids: 4,8 Queue C ids: 5

Now each queue parallel to and independent from other other queues, follows the existing logic to send its post to the target instance. So queue A will send 1 then 2 then 3 etc. If any fails, the queue aborts anything subsequent. So if queue A failed to send ID 3, it will abort, so that 3,6 and 7 will remain in the overarching queue to send.

However the failure of 3, will not stop 4,8 and 5 from going through.

Now on the next iteration of sending, 1,2,4,8,5 are gone, so the next queues will pick up 3,6,7 to send, along with 7 other IDs and split them into individual queues.

This would allow an instance to send 1-n requests in parallel without ever running the risk of sending them in the wrong chronological order.

db0 avatar Mar 13 '24 13:03 db0

@db0 In particular we can assign activities to a specific queue by post_id: queue_id = post_id modulo 10 (or alternatively using community_id).

Nutomic avatar Mar 13 '24 14:03 Nutomic

You mean so that you have different posts per queue? Sure that could work as well. I like the idea of using the exact post ID for easier troubleshooting potential myself.

db0 avatar Mar 13 '24 15:03 db0

Apparently our HTTP client doesnt keep connections alive

Do you have a source on that? As far as I'm aware it does keep them alive and thus for replication has one persistently open connection.

Other than that, I'll restate what I wrote on Matrix:

I think we can solve this in a fairly simple way without any robustness issues:

Instead of sending one activity, waiting for the 200 response, then sending the next activity, we can instead send multiple activities (sequentially down the same HTTP connection) and only start waiting for a response when 8 are currently in transit.

The only thing unclear to me is how this can be implemented. On tokio's side using a channel with .buffered(8) on the receiving end should work, but I'm not sure how to make reqwest send multiple requests while still ensuring they are in the same sequential connection.

This way, the internet latency does not matter as long as the parallel number chosen (8) is > than the internet latency divided by the internal processing time. E.g. if ping is 320ms and internal processing is 10ms, then 32 requests would need to be in the queue at once to not slow things down.

Note that this does not change anything about the internal latency of an instance, that would still be sequential per receiving instance. But it removes all the latency outside of the Rust process.

phiresky avatar Mar 14 '24 11:03 phiresky

Maybe we can use something like this to get notified when the entire activity is sent, and then send out the next one. However that would require changes inside the federation library and exposing it in the public api. It would be easier and probably sufficient in practice if we simply wait eg 100ms between requests.

Nutomic avatar Mar 14 '24 11:03 Nutomic

@phiresky The only thing unclear to me is how this can be implemented. On tokio's side using a channel with .buffered(8) on the receiving end should work, but I'm not sure how to make reqwest send multiple requests while still ensuring they are in the same sequential connection.

Is this HTTP Response Streaming (Transfer-Encoding: chunked) or batching multiple activities into one request [{...}, {...}, ...] ?

For chunked these seem to be relevant in Reqwest: https://docs.rs/reqwest/latest/reqwest/struct.Body.html#method.wrap_stream https://docs.rs/reqwest/latest/reqwest/struct.Response.html#method.bytes_stream

Also a good note is that HTTP/2 basically always does chunked encoding, so the above applies only to 1.1.

wereii avatar Mar 14 '24 11:03 wereii

, but I'm not sure how to make reqwest send multiple requests while still ensuring they are in the same sequential connection.

That is exactly the problem with parallel requests in isolation and it's non-trivial to solve, which is why I suggested a more robust split-queue

db0 avatar Mar 14 '24 12:03 db0

Why is it non-trivial to solve? if the library provides an option to do it it's trivial to solve. the only problem is that by default it apparently uses HTTP2-pipelining which is not in-order

phiresky avatar Mar 14 '24 12:03 phiresky

if the library provides an option to do it it's trivial to solve

Does the library provide an option then?

db0 avatar Mar 14 '24 12:03 db0

Is this HTTP Response Streaming

It's HTTP pipelining image

The docs you sent seem to be about responses, but here we really only care about request and request body

phiresky avatar Mar 14 '24 12:03 phiresky

This helps you send multiple requests, not ensure they're processed in order which is the non-trivial part

db0 avatar Mar 14 '24 12:03 db0

they are guaranteed to be sent in order and received in order so processing them in order should not be hard

phiresky avatar Mar 14 '24 12:03 phiresky

so processing them in order should not be hard

famous last words :D

I don't think the http pipelining ensure that the response you get will be 200 before sending the next request. only that there will be a response. How does it guarantee for example that a request with ID 1 won't take too long to process before ID 2 hits the target?

db0 avatar Mar 14 '24 12:03 db0

on the receiving side the requests can still be processed fully sequentially, no need for parallelism. note this only solves the problem in this ticket and not the one you had with internal latency

phiresky avatar Mar 14 '24 12:03 phiresky

Very important from the article in wikipedia

image

From the linked source

image

If I'm not mistaken, the POSTS to inbox, are not idempotent. A connection loss while sending will cause issues.

db0 avatar Mar 14 '24 12:03 db0

it does look like reqwest doesn't really have options for this, probably since http1.1 pipelining is mostly seen as superseeded by http2 multiplexing, which is better in most ways except that it doesn't guarantee any sequentiality. so probably we should use just normal http2 multiplexing (which should just work with .buffered())

I think we can semi-ignore the sequentiality problem (e.g. just add a 10ms delay between sends) though because:

If an instance is catching up, and processing e.g. 100ms per activity, and we have a parallelity of 8, then while the first 8 events will be sent simultaneously and potentially processed out of order, after that each next request will wait for the OK response 8 requests earlier, which means it will have a average delay of 10ms after the previous activity.

phiresky avatar Mar 14 '24 12:03 phiresky

If you don't ensure for sequential you can end up with situations of the previous edits or votes overriding newer edits/votes if they are in the same pipeline. Likewise, a connection loss in the middle of the request will cause you issues which you can't resolve by resending, since you're not idempotent.

db0 avatar Mar 14 '24 12:03 db0

Another alternative that's still simpler and less overhead than adding per-thread or per-community queues would be to add a field to each json activity or http request header with the sequential id and on the receiving side add (if the number is given) a priority queue to which events are added and processed immediately if the sequential id was incremented by 1 and with a max delay of 10s or so if it was incremented by > 1

phiresky avatar Mar 14 '24 12:03 phiresky

Anything which extends the apub protocol, will not be handled by other software properly.

db0 avatar Mar 14 '24 12:03 db0

it's a compatible change so it doesn't affect anything else

phiresky avatar Mar 14 '24 12:03 phiresky

What do you mean by "compatible change"?

db0 avatar Mar 14 '24 12:03 db0

it just adds a header or json field that will be ignored if not understood.

pseudo code:

post(/inbox).handle(activity):
     queues[activity.instance].add(priority=activity.sequence_id, activity)

def queue_process():
     queue = queues["lemmy.world"]
     last_id = 0
     while true:
         if queue.peek != last_id + 1: sleep 10s
         process(queue.take())

phiresky avatar Mar 14 '24 12:03 phiresky

That's my point. If you rely on something that can be ignored, then non-lemmy software (kbin, sublinks, mastodon, piefed etc) can experience significant issues, such as bad edits, missing content when issues occur

db0 avatar Mar 14 '24 12:03 db0

but i have to say since completely ignoring activity sequence in a much much worse way than the thing above is how all of lemmy worked before 0.19 i don't think this minor sequencing issue is really going to cause trouble. on high-volume instances with a parallelity of like 10 it's very unlikely there's a conflicting event is <10 events next to another conflicting one because you don't edit a post 10 times per second, and on less active instance it's also not going to be an issue because the parallelity doesn't actually have to come into effect

phiresky avatar Mar 14 '24 12:03 phiresky

You can't know that though. We can't foresee all potential issues. but we know that this approach has potential pitfalls from simple connection problems. I don't see the point of refactoring to a method with includes known risks and does exactly what the designers of http pipelining tell you not to do.

PS: I am known to hit the wrong vote and correct it within the same second. Somtimes even setting whole posts as stickies through fat fingers

db0 avatar Mar 14 '24 12:03 db0

i see what you're saying, though what i wrote now is now unrelated to http pipelining. imo we can do the simple solution and if it actually causes problems we can still add the sequence enforcing later. the first is a prerequisite for the second anyways so implementing it is not a waste even if it'll cause (unlikely) issues

phiresky avatar Mar 14 '24 12:03 phiresky

splitting by community or post sounds neat but it's pretty complicated and sound like the kind of thing that's very hard to tune to balance the overhead, and the kind of thing where instance admins have no idea how to tune it and complain that lemmy has gotten slower because suddenly it creates 1000 queues per instance instead of 1 and you can't even tell what those config parameters should be without trying it out in production

phiresky avatar Mar 14 '24 13:03 phiresky

I agree we should go for simpler solutions if possible. No point overengineering something which may not even work in the end. The activity sequence id makes sense in this way. It would be good to write a short FEP about it as it will also be relevant for other projects.

Nutomic avatar Mar 14 '24 13:03 Nutomic