featurebase
featurebase copied to clipboard
Not possible to resize down cluster with replication factor 1
For feature requests, please provide the following:
Description
Can not resize down the cluster after forced to over provision to be able to handle import load.
$ curl coordinator:10101/cluster/resize/remove-node -X POST -d '{"id": "905214dc-d01f-4182-9328-93bc2ca4584b"}'
removing node: calling node leave: generating job: getting sources: not enough data to perform resize (replica factor may need to be increased)
Success criteria (What criteria will consider this ticket closeable?)
It should be possible to move shards that node-to-be-removed owns to other nodes and do the resize
Thanks @dmibor I assume your replication is set to 1. That is not necessarily the problem, though it highlights the problem. The remove node function is excluding from the list of sources the node that is being removed. Here: https://github.com/pilosa/pilosa/blob/master/cluster.go#L780
The thinking there was that one would be removing a node from the cluster because the node was no longer available, and therefore couldn't act as a source. But for down-sizing a cluster, that's not necessarily the case; in fact it's likely that the node being removed is still available as a source.
So we'll need to modify this logic a bit to take the node's health into consideration before excluding it as a source. (And the reason why this doesn't matter when replication > 1 is because in that case there is at least one other node that can act as a source for the fragments on the node being removed).
We'll take a look at this and try to have that node-state logic considered.
It's possible to manage this use case. If remove-node is called on a cluster with replication 1, we just need to check that all nodes are available. If they are, we can ignore the check that prevents the removed node from being considered as a source of fragment data. In that case, it's safe to proceed with the node removal.