fastd icon indicating copy to clipboard operation
fastd copied to clipboard

Feature proposition: Fastd Load-Balancer/Controller

Open BarbarossaTM opened this issue 3 years ago • 4 comments

Hi,

I'd like to propose an idea for load-balancing fastd connections over a set of know end-points.

We (Freifunk Hochstift) are setting up more POPs with different upstreams and I would like to steer client connections from CPEs to the nearest POP (for whatever definition of near, see strategies below).

The general idea is do set up (a number of) fastd LBs/controllers where the fastd "client" will connect to if configured to do so and will be pointed to it's nearest server. For security reasons the server will be specified as a string which has to be part of the clients fastd.conf (FQDN + port or IP + port). (Maybe the check against the config can be deactivated with a config option in the client, if people would want to see that, really.) The fastd client will then connect to the hinted server (if found within it's config) or any peer if there was no match.

The only role of the LB/controller will be to hint a client to a fastd peer and it will never see any traffic. Redundancy will be achieved by having multiple controllers configured in the client config and allowing multiple DNS-RRs for one FQDN. The controller should allow implementing multiple strategies the first one will be "same AS" where for example CPEs with an IP from $ISP will be directed to a peer within the same ASN. It would probably be cool to allow using multiple strategies in a configured order (first match wins). There might by other ideas for what useful strategies are like "bandwidth available", "CPU usage", etc. (where ever this information will come from is left as an exercise for the operator).

What I'm unsure about is how to handle multiple "sites", "domains" or how ever those are called, as in a Freifunk setup not all peers might have all sites/domains configued for various reasons. So the client has to specify the site/domain or the controller has to listen on multiple ports and deduce this information from there (we use different ports for different sites/domains).

I propose a text only protocol like getBestPeer [SITE] which will return peerId IDENTIFIER and/or peerInfo FQDN PORT one per line.

This whole endeavor will obvious increase the time to set up a connection but will most likely improve the performance and latency afterwards. I intend to hack together a PoC for the LB in Python shortly and if that turns out to be anything like I hope I guess I guess I will implement a more production ready one in Golang and contribute patches for fastd.

I would welcome feedback on this, especially on the site/domain part :)

BarbarossaTM avatar Oct 17 '20 14:10 BarbarossaTM

For the purpose of load-balancing fastd connections, a very simple way to make a connection less likely is to delay the handshake response - as long as the response is be received within 15s, the handshake will still succeed. As fastd will try to connect to all peers of a peer group at the same time [1], but only the first n handshake replies (for a peer limit n) will actually establish a connection, statistically the peers with the shortest delay will be chosen.

The chosen delay could be handled by an external hook; I actually started implementing such a feature a while ago (in 2016), but I never finished it. The logic of the hook would be out of scope for fastd itself, but both local (based on system load, peer AS, ...) and global (exchanging information with other servers or a global controller) strategies seem feasible.

An advantage of this scheme is that it doesn't require any changes for existing fastd "clients" (except for the below note, which is only a nice-to-have improvement).

A slightly more involved solution that would work with a small protocol extension would be to add a "preference" record to the fastd handshake, allowing a client to make the choice which connection to accept explicit, and reducing additional delays in connection establishment. Both approaches could be combined (signaling whether a "preference" is supported using the "flags" handshake record, and using delay-based steering for connections using an old version of fastd).

I realize that this is a completely different design than what you have in mind, but I don't think that layering a completely separate protocol around fastd connection establishing, with additional infrastructure for the controller that needs to be set up and maintained, is the right way forward for a tool like fastd that aims for small size and simplicity.

Notes: [1] This is not entirely true at the moment - initial handshakes will be delayed by DNS resolution, and after the initial handshakes, the handshakes to different peers will desynchronize due to random delays. I believe that these issues can be addressed without too much trouble.

neocturne avatar Oct 28 '20 22:10 neocturne

I have drafted an implementation of @NeoRaider's suggestion.

We (Freifunk Stuttgart) will experiment with this approach and, if it works, I will make a PR.

nrbffs avatar Apr 23 '21 21:04 nrbffs

We have implemented the delay-based approach since and have some details in our wiki (German).

We have two main findings:

  • fastd implements a random delay between 0 and 3 seconds when sending the inital handshake. For the load balancing, this is undesirable. We have a Gluon patch to remove the delay. This could become a main configuration option like randomize peer choice on|off. What do you think?
  • We now have a on verify script to implement the delay. This unfortunately means this script needs to re-implement the logic which looks for the key in the file system. We have a draft patch to add a hook specifically for delaying the handshake.

These two patches make the random delay a viable approach to improve load balancing.

nrbffs avatar May 15 '21 20:05 nrbffs

* fastd implements a random delay between 0 and 3 seconds when sending the inital handshake. For the load balancing, this is undesirable. We have a [Gluon patch to remove the delay](https://gitlab.freifunk-stuttgart.de/firmware/ffs-pipeline-nightly/-/blob/6fcf96d1e30710300dd84fffaba4839389d16712/patches/0002-add-patch-to-remove-fastd-random-delay-on-inital-han.patch). This could become a main configuration option like `randomize peer choice on|off`. What do you think?

Makes sense. I'll have to think about a good name for the setting.

* We now have a `on verify` script to implement the delay. This unfortunately means this script needs to re-implement the logic which looks for the key in the file system. We have a [draft patch to add a hook specifically for delaying the handshake](https://github.com/nrbffs/fastd/commit/b11a18025f5fab2a35bf5f64dc9ff04d122d6625).

Both options seem okay to me:

  • An additional option could be added so that on verify is run for known peers as well (and the script is passed whether the peer is known through the environment.) This would allow to add a delay in on verity without having the reimplement the key matching. It would also make on verify more versatile, so this might be the way to go.
  • A separate command would be possible, but it will require a more complex implementation (like the on verify one) to execute the command asynchronously and run a callback that continues with the handshake when the command finishes. Most of fastd is single-threaded, so running a command synchronously will block all operation of fastd, including packet forwarding for existing connections.

neocturne avatar Jun 19 '21 07:06 neocturne