foca
foca copied to clipboard
Lots of CPU time spent sorting broadcast bytes
I'm actually not sure where the time is being spent, but most of the CPU time in my project is currently being used sorting something related to broadcast messages.
Flamegraph: corro.svg.zip
Looks like it goes like this:
Foca::broadcastFoca::send_messageBroadcasts::fillcore::slice::sort::recursecore::slice::sort::partial_insertion_sortPartialOrd::lt
I have a loop that calls Foca::broadcast every 200ms to ensure maximum gossip propagation of broadcast messages.
Hah I half expected this to come up :grimacing:
The problem is that custom broadcasts use the same mechanism used for cluster broadcasts: it's a pretty dumb sorted array.
SWIM can get by with this inefficient thing because broadcasts are bound by the cluster size and are tiny (it's just cluster updates, and there's invalidation member identity). When I was hacking on it what I wanted was a binary heap with retain capabilities, but didn't want to jump on nightly and was too lazy to roll my own, so I ended up with what we have now heh
One way forward is making Broadcasts better. The api surface is small and requirements clear:
- add_or_replace must be able to remove items based on some invalidation logic
- fill must fill the given byte slice from the least-transmitted largest broadcast (what made me want a heap, using num_transmissions as objective, len(broadcast) to break even)
Another would be to disconnect the custom broadcasts from the main interaction (I think I mentioned this one before).
First alternative is a lot easier I think.
I don't think I'll be looking at this until at least the holidays season in December but I'm happy to receive/review pull-requests if you wanna take a stab at it :grin:
That makes sense.
I might've noticed this wasn't a problem when I was using larger max payload size, transmitted via TCP. Would that have made a difference?
For reference: my cluster size is now ~300 nodes.
Another would be to disconnect the custom broadcasts from the main interaction (I think I mentioned this one before).
This might be easier on our end! If I didn't use the Broadcast stuff from foca at all and just randomly send messages to other nodes.
I don't think I'll be looking at this until at least the holidays season in December but I'm happy to receive/review pull-requests if you wanna take a stab at it 😁
All good. I just might take a stab at it yes! I will have more questions if I do that.
I ended up doing my own broadcasts, handling messages with logic similar to the BroadcastHandler trait. I'm using the config's number of probes to know how many I should broadcast to and I use rand's choose_multiple to pick from all active members.
That solved my CPU issues!
Sweet! I think I over-promised a bit on the general purpose broadcasting feature :sweat_smile:
I think when optimizing these sort of broadcasts you might end up wanting to do some fancier way of prioritizing which packets to send first, maybe even some way of feeding back information about the current known global state so you can prune/reorder the queue... It's why I think the way to go is disconnecting broadcasts from foca and just exposing some functionality to enable the use case.
From the top of my head, I could expose choose_active_members (without the unneeded &mut self) to make picking members easier; But as you've noticed, rand's choose_multiple on the active members does the job just as well... Let me know if there's more stuff foca could expose to make this easier for you
Yeah, we've ended up implementing some logics for some broadcast messages to go out immediately instead of buffering a few of them.
Did foca do anything about re-broadcasting when a member joined after the initial buffer had been broadcasted? I guess it would behave like that if it didn't have anybody to broadcast to? In my new logic, I'm not even retrying broadcasts if there are no members to broadcast something to. I assume foca tried for a while until it could send its data to as many members as num_indirect_probes?
That's right, foca doesn't try to do anything smart. It simply buffers a broadcast up and tries to send it Confix::max_transmissions times; it doesn't keep track of who it sent to, so if there's a single active member in the cluster, they will keep receiving the same broadcast over and over until the counter goes to zero
closing stale issues that seem resolved. feel free to reopen