lemmy
lemmy copied to clipboard
[Bug]: Federation throughput in a synchronous manner is limited to network distance
Requirements
- [X] Is this a bug report? For questions or discussions use https://lemmy.ml/c/lemmy_support
- [X] Did you check to see if this issue already exists?
- [X] Is this only a single bug? Do not put multiple bugs in one issue.
- [X] Do you agree to follow the rules in our Code of Conduct?
- [X] Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.
Summary
Problem: Activities are sequential but requires external data to be validated/queried that doesn't come with the request. Server B -> A, says here is an activity. In that request can be a like/comment/new post. An example of a new post would mean that Server A, to show the post metadata (such as subtitle, or image) queries the new post.
Every one of these outbound requests that the receiving server does are:
- Sequential, (every request must happen in order: 1,2,3,4...
- Is blocking. Server B which sent a message to server A, must wait for Server A to say "I'm Finished" before sending the next item in queue.
- Are inherently subsequent to network latency (20ms to 600ms)
- Australia to NL is 278ms (round trip 556ms)
- NL to LA is 145ms (round trip 290ms)
- I picked NL because it is geographically, and literally, on the other side of the world from Australia. This is (one of) if not the longest route between two lemmy servers.
Actual Problem
So every activity that results in a remote fetch delays activities. If the total activities that results in more than 1 per 0.6s, servers physically cannot and will never be able to catch up. As such our decentralised solution to a problem requires a low-latency solution. Without intervention this will evidently ensure that every server will need to exist in only one region. EU or NA or APAC (etc.) (or nothing will exist in APAC, and it will make me sad) To combat this solution we need to streamline activities and how lemmy handles them.
Steps to Reproduce
- Have a lemmy server in NL send activities faster that 1 request every 0.6 seconds to a lemmy server in australia.
- If you send New Post activities, they can affect the activity processing the most / are the longest to help validate the PoC.
Technical Details
Trace 1:
Lemmy has to verify a user (is valid?). So it connects to a their server for information. AU -> X (0.6) + time for server to respond = 2.28s but that is all that happened.
- 2.28s receive:verify:verify_person_in_community: activitypub_federation::fetch: Fetching remote object http://server-c/u/user
- request completes and closed connection
Trace 2:
Similar to the previous trace, but after it verfied the user, it then had to do another from_json
request to the instance itself. (No caching here?) As you can see 0.74 ends up being the server on the other end responding in a super fast fashion (0.14s) but the handshake + travel time eats up the rest.
- 2.58s receive:verify:verify_person_in_community: activitypub_federation::fetch: Fetching remote object http://server-b/u/user
- 0.74s receive:verify:verify_person_in_community:from_json: activitypub_federation::fetch: Fetching remote object http://server-b/
- request continues
Trace 3:
Fetching external content. I've seen external servers take upwards of 10 seconds to report data, especially because whenever a fediverse link is shared, every server refreshes it's own data. As such you basically create a mini-dos when you post something.
- inside a request already
- 4.27s receive:receive:from_json:fetch_site_data:fetch_site_metadata: lemmy_api_common::request: Fetching site metadata for url: https://example-tech-news-site/bitcoin-is-crashing-sell-sell-sell-yes-im-making-a-joke-here-but-its-still-a-serious-issue-lemmy-that-is-not-bitcoin
Trace 4:
Sometimes a lemmy server takes a while to respond for comments.
- 1.70s receive:community: activitypub_federation::fetch: Fetching remote object http://server-g/comment/09988776
Version
0.19.3
Lemmy Instance URL
No response
Just collecting some further points from discussions on Matrix:
- The core issue is that even if Lemmy did absolutely no processing for incoming events, a ping of ~300ms still limits us to ~3 incoming activities per second. This is right on the edge of what lemmy.world is generally sending out, so any kind of extra delays on top of this (or growth from lemmy.world) will guarantee falling behind.
- Sending/receiving activities in parallel without any further logic (as was done before 0.19) can speed things up, but will make federation unreliable once again (as it was before 0.19) - different instances will be very likely to show different versions of posts, comments, votes, etc, as all edits and vote changes will not propagate in the correct order through the network.
- Splitting up the queue on the sending side (for example, by the community where the activity is in) could improve the situation on the receiving side, but would add a lot of overhead on the sending side
- Batching activities together would be ideal, as it would allow us to maintain correct ordering, but it appears that ActivityPub does not support such a thing at the moment
- @phiresky mentioned the option of switching to unreliable federation (as in parallelized, without trying to guarantee the correct order of activities) whenever the sender detects they are falling behind - this is perhaps the quickest band-aid solution, with the trade-off of still creating opportunities for instances to go out of sync
Here are some potential solutions.
- Apparently our HTTP client doesnt keep connections alive. So by enabling that and other options we can get rid of some overhead. Relevant code and documentation.
- Lemmy processes incoming activities immediately, which can be slow when additional data needs to be fetched. Mastodon and other platforms instead put received activities in a queue for later processing. So we can also consider it.
- A major fraction of all activities are probably votes. We could mark votes (or only comment votes) as low priority in the outgoing federation queue. Then if federation falls behind, drop some of the vote activities. Similar to https://github.com/LemmyNet/lemmy/issues/4527 but it can probably be automatic.
- Batching multiple activities in a single request would be ideal, but Im not aware of any previous work in this direction. It would be good to open an FEP to discuss this and get input from other Fediverse devs.
Solution through brainstorming on matrix to parallelize sending requests
- Pick up the last X requests needing to be sent to target server. Where
X = 10
(with10
adjustable), or perhaps more dynamicallyX = ceil(queue_size / 1000)
(with1000
adjustable). CurrentlyX == 1
. - For every request we pickup, split it into a different thread per parent post or add to the "parentless queue". So let's say I pick these 10
activity | parent | ID |
---|---|---|
post A | post A | 1 |
comment A | post A | 2 |
vote A | post A | 3 |
vote B | post B | 4 |
ban user | parentless | 5 |
edit comment A (1 attempt) | post A | 6 |
edit comment A (2 attempt) | post A | 7 |
edit comment B | post B | 8 |
We end up with the following sending queues
Queue A ids: 1,2,3,6,7 Queue B ids: 4,8 Queue C ids: 5
Now each queue parallel to and independent from other other queues, follows the existing logic to send its post to the target instance. So queue A will send 1 then 2 then 3 etc. If any fails, the queue aborts anything subsequent. So if queue A failed to send ID 3, it will abort, so that 3,6 and 7 will remain in the overarching queue to send.
However the failure of 3, will not stop 4,8 and 5 from going through.
Now on the next iteration of sending, 1,2,4,8,5 are gone, so the next queues will pick up 3,6,7 to send, along with 7 other IDs and split them into individual queues.
This would allow an instance to send 1-n requests in parallel without ever running the risk of sending them in the wrong chronological order.
@db0 In particular we can assign activities to a specific queue by post_id: queue_id = post_id modulo 10
(or alternatively using community_id).
You mean so that you have different posts per queue? Sure that could work as well. I like the idea of using the exact post ID for easier troubleshooting potential myself.
Apparently our HTTP client doesnt keep connections alive
Do you have a source on that? As far as I'm aware it does keep them alive and thus for replication has one persistently open connection.
Other than that, I'll restate what I wrote on Matrix:
I think we can solve this in a fairly simple way without any robustness issues:
Instead of sending one activity, waiting for the 200 response, then sending the next activity, we can instead send multiple activities (sequentially down the same HTTP connection) and only start waiting for a response when 8 are currently in transit.
The only thing unclear to me is how this can be implemented. On tokio's side using a channel with .buffered(8) on the receiving end should work, but I'm not sure how to make reqwest send multiple requests while still ensuring they are in the same sequential connection.
This way, the internet latency does not matter as long as the parallel number chosen (8) is > than the internet latency divided by the internal processing time. E.g. if ping is 320ms and internal processing is 10ms, then 32 requests would need to be in the queue at once to not slow things down.
Note that this does not change anything about the internal latency of an instance, that would still be sequential per receiving instance. But it removes all the latency outside of the Rust process.
Maybe we can use something like this to get notified when the entire activity is sent, and then send out the next one. However that would require changes inside the federation library and exposing it in the public api. It would be easier and probably sufficient in practice if we simply wait eg 100ms between requests.
@phiresky The only thing unclear to me is how this can be implemented. On tokio's side using a channel with .buffered(8) on the receiving end should work, but I'm not sure how to make reqwest send multiple requests while still ensuring they are in the same sequential connection.
Is this HTTP Response Streaming (Transfer-Encoding: chunked
) or batching multiple activities into one request [{...}, {...}, ...]
?
For chunked
these seem to be relevant in Reqwest:
https://docs.rs/reqwest/latest/reqwest/struct.Body.html#method.wrap_stream
https://docs.rs/reqwest/latest/reqwest/struct.Response.html#method.bytes_stream
Also a good note is that HTTP/2 basically always does chunked encoding, so the above applies only to 1.1.
, but I'm not sure how to make reqwest send multiple requests while still ensuring they are in the same sequential connection.
That is exactly the problem with parallel requests in isolation and it's non-trivial to solve, which is why I suggested a more robust split-queue
Why is it non-trivial to solve? if the library provides an option to do it it's trivial to solve. the only problem is that by default it apparently uses HTTP2-pipelining which is not in-order
if the library provides an option to do it it's trivial to solve
Does the library provide an option then?
Is this HTTP Response Streaming
It's HTTP pipelining
The docs you sent seem to be about responses, but here we really only care about request and request body
This helps you send multiple requests, not ensure they're processed in order which is the non-trivial part
they are guaranteed to be sent in order and received in order so processing them in order should not be hard
so processing them in order should not be hard
famous last words :D
I don't think the http pipelining ensure that the response you get will be 200 before sending the next request. only that there will be a response. How does it guarantee for example that a request with ID 1 won't take too long to process before ID 2 hits the target?
on the receiving side the requests can still be processed fully sequentially, no need for parallelism. note this only solves the problem in this ticket and not the one you had with internal latency
Very important from the article in wikipedia
From the linked source
If I'm not mistaken, the POSTS to inbox, are not idempotent. A connection loss while sending will cause issues.
it does look like reqwest doesn't really have options for this, probably since http1.1 pipelining is mostly seen as superseeded by http2 multiplexing, which is better in most ways except that it doesn't guarantee any sequentiality. so probably we should use just normal http2 multiplexing (which should just work with .buffered()
)
I think we can semi-ignore the sequentiality problem (e.g. just add a 10ms delay between sends) though because:
If an instance is catching up, and processing e.g. 100ms per activity, and we have a parallelity of 8, then while the first 8 events will be sent simultaneously and potentially processed out of order, after that each next request will wait for the OK response 8 requests earlier, which means it will have a average delay of 10ms after the previous activity.
If you don't ensure for sequential you can end up with situations of the previous edits or votes overriding newer edits/votes if they are in the same pipeline. Likewise, a connection loss in the middle of the request will cause you issues which you can't resolve by resending, since you're not idempotent.
Another alternative that's still simpler and less overhead than adding per-thread or per-community queues would be to add a field to each json activity or http request header with the sequential id and on the receiving side add (if the number is given) a priority queue to which events are added and processed immediately if the sequential id was incremented by 1 and with a max delay of 10s or so if it was incremented by > 1
Anything which extends the apub protocol, will not be handled by other software properly.
it's a compatible change so it doesn't affect anything else
What do you mean by "compatible change"?
it just adds a header or json field that will be ignored if not understood.
pseudo code:
post(/inbox).handle(activity):
queues[activity.instance].add(priority=activity.sequence_id, activity)
def queue_process():
queue = queues["lemmy.world"]
last_id = 0
while true:
if queue.peek != last_id + 1: sleep 10s
process(queue.take())
That's my point. If you rely on something that can be ignored, then non-lemmy software (kbin, sublinks, mastodon, piefed etc) can experience significant issues, such as bad edits, missing content when issues occur
but i have to say since completely ignoring activity sequence in a much much worse way than the thing above is how all of lemmy worked before 0.19 i don't think this minor sequencing issue is really going to cause trouble. on high-volume instances with a parallelity of like 10 it's very unlikely there's a conflicting event is <10 events next to another conflicting one because you don't edit a post 10 times per second, and on less active instance it's also not going to be an issue because the parallelity doesn't actually have to come into effect
You can't know that though. We can't foresee all potential issues. but we know that this approach has potential pitfalls from simple connection problems. I don't see the point of refactoring to a method with includes known risks and does exactly what the designers of http pipelining tell you not to do.
PS: I am known to hit the wrong vote and correct it within the same second. Somtimes even setting whole posts as stickies through fat fingers
i see what you're saying, though what i wrote now is now unrelated to http pipelining. imo we can do the simple solution and if it actually causes problems we can still add the sequence enforcing later. the first is a prerequisite for the second anyways so implementing it is not a waste even if it'll cause (unlikely) issues
splitting by community or post sounds neat but it's pretty complicated and sound like the kind of thing that's very hard to tune to balance the overhead, and the kind of thing where instance admins have no idea how to tune it and complain that lemmy has gotten slower because suddenly it creates 1000 queues per instance instead of 1 and you can't even tell what those config parameters should be without trying it out in production
I agree we should go for simpler solutions if possible. No point overengineering something which may not even work in the end. The activity sequence id makes sense in this way. It would be good to write a short FEP about it as it will also be relevant for other projects.