cassabon
cassabon copied to clipboard
Improve cluster stability
Right now, if we lose a node, we lose the stats that node handles the buckets for until it comes back up. Let's make it more robust so that we can lose some of a cluster and still retain stats.
Design Brief
- Introduce the concept of "my paths" vs "guest paths" in the internal data representation
- Note when forwarding a path to its home server fails
- On failure to forward a path to its home failure, notify all remaining servers of the failure for synchronization purposes
- Recalculate the hash index based on the number of remaining servers, to determine which server will guest host each path until the dead server comes back up
- When the dead server comes up, broadcast that information to the set
- On receipt of the broadcast, all servers with guest rollups will transfer them to the restored server
That's the rough sketch; details will be worked out when implementing.
@mredivo #100 addresses all of this, and this can be closed, right?
The preliminary work is complete with #100; now the implementation can begin.
Gotcha.
@mredivo: What's left as far as implementation is concerned? Will you be able to get to this over the break, or should I take a stab at it?
Go ahead and give it a shot; feel free to pry me for details. It will be a few weeks before my plate isn't completely full again.
The design brief a few comments up covers the general idea.
@mredivo Question...
What if we used a gossip implementation instead? Seems like that could solve quite a few problems.
There's Serf as well, which seems to be a more robust implementation of memberlist. It's a standalone agent, though.
Looks worth studying, will take a detailed look later.