cassabon icon indicating copy to clipboard operation
cassabon copied to clipboard

Improve cluster stability

Open jeffpierce opened this issue 9 years ago • 9 comments

Right now, if we lose a node, we lose the stats that node handles the buckets for until it comes back up. Let's make it more robust so that we can lose some of a cluster and still retain stats.

jeffpierce avatar Nov 04 '15 20:11 jeffpierce

Design Brief

  • Introduce the concept of "my paths" vs "guest paths" in the internal data representation
  • Note when forwarding a path to its home server fails
  • On failure to forward a path to its home failure, notify all remaining servers of the failure for synchronization purposes
  • Recalculate the hash index based on the number of remaining servers, to determine which server will guest host each path until the dead server comes back up
  • When the dead server comes up, broadcast that information to the set
  • On receipt of the broadcast, all servers with guest rollups will transfer them to the restored server

That's the rough sketch; details will be worked out when implementing.

mredivo avatar Nov 09 '15 18:11 mredivo

@mredivo #100 addresses all of this, and this can be closed, right?

jeffpierce avatar Nov 23 '15 23:11 jeffpierce

The preliminary work is complete with #100; now the implementation can begin.

mredivo avatar Nov 25 '15 00:11 mredivo

Gotcha.

jeffpierce avatar Nov 25 '15 06:11 jeffpierce

@mredivo: What's left as far as implementation is concerned? Will you be able to get to this over the break, or should I take a stab at it?

jeffpierce avatar Dec 17 '15 20:12 jeffpierce

Go ahead and give it a shot; feel free to pry me for details. It will be a few weeks before my plate isn't completely full again.

The design brief a few comments up covers the general idea.

mredivo avatar Dec 17 '15 23:12 mredivo

@mredivo Question...

What if we used a gossip implementation instead? Seems like that could solve quite a few problems.

jeffpierce avatar Apr 22 '16 20:04 jeffpierce

There's Serf as well, which seems to be a more robust implementation of memberlist. It's a standalone agent, though.

jeffpierce avatar Apr 22 '16 20:04 jeffpierce

Looks worth studying, will take a detailed look later.

mredivo avatar Apr 22 '16 22:04 mredivo