VictoriaMetrics Add fault domain awareness to storage

Is your feature request related to a problem? Please describe

We are moving to a new infrastructure architecture that will define formal hardware fault domains. To operate properly in this environment, and particularly because maintenance operations will be aligned to these fault domains, we need to be able to deploy our storage nodes in a way that is also aligned, to avoid loss of data or loss of availability.

Describe the solution you'd like

We would like to see:

vm storage gain the ability to place storage nodes into groups that align to fault domains
vminsert be aware of these groups so that data replicas are always placed in different groups
support for more groups than the replication factor.
- eg: if the replication factor is 3 and there are 4 groups, data will be placed randomly in 3 out of the 4 groups (so that the overall distribution of data is balanced). If one of the 4 groups is unavailable, the data will be placed in the other three. Only if 2 groups were unavailable would inserts fail.
- similarly, vmselect can succeed if any two groups are available

Describe alternatives you've considered

We discussed using multi-level vminsert/vmselect on top of having one cluster per fault domain, but there are known performance bottlenecks in this architecture and it will not work correctly today given that each of these clusters needs to have replicationFactor=1. There is also increased operational overhead from running multiple clusters in this way (and we are running multiple clusters to begin with) as well as resource provisioning complexity in trying to appropriately size the vminsert/vmselect deployments.
We also discussed separate clusters per fault domain using vmagent multi-write but this doesn't have a good solution for queries, and has the same multi-cluster operational complexity as above.

Additional information

#3531 describes fault domain alignment at a high level in a way that's consistent with the ask here, but look like notable differences in the intended use-case, so I do not want to simply point to that issue.

Apr 02 '24 19:04 plangdale-roblox

vminsert be aware of these groups so that data replicas are always placed in different groups

I'm afraid this could be a no-go feature for vminsert. If one vmstorage in any group will stop responding then vminsert will have to re-route traffic to another replica. Traffic rerouting is resource-intensive process and will put additional pressure on alive replicas, making them to process more data. These new replicas may struggle for a short period of time while registering the new series. During this period, the whole ingestion might be blocked/degraded as vminserts aren't supposed to buffer data. This is why things like -disableReroutingOnUnavailable or -storage.vminsertConnsShutdownDuration exist.

Let's try to put this on the paper:

On the image above, vminsert gains a concept of storage groups. However, if gr1/storage-1 goes offline - the whole cluster will "feel" it because of ingestion delay and resource usage spike due to re-routing.

In the last versions, VM gain significant improvement of re-routing performance. However, its impact is still significant for high workloads. So the best way to avoid the re-routing penalty is to not have re-routing at all. Let's see how "separate clusters per fault domain using vmagent multi-write" could look like:

Here, instead of top-level vminserts we have vmagents configured with 2 remote-write URLs each. It means each vmagent does 2 copies of data, and maintains independent queues Q for each destination. Each destination is a vminsert in specific group, the vminsert is responsible for sharding and, optionally, replication within the group. On read path, vmselect reads from all vmstorages, merges the results and responds to user.

Let's see what will happen if we plug off Gr1 from power outlet:

The ingestion path will remain fine. Thanks to persistent queue, vmagent will start buffering data for Q1 on local disk. No re-routing will happen, other groups remain unaffected.

On read path, we have an issue with replication factor - vmselect will not respond with success because more vmstorages than replication factor are unavailable. Here, I propose to introduce a new flag -replicationFactorGroup which will ignore errors within one or more groups completely. Basically, it does the "vmselect can succeed if any two groups are available" described above. Additionally to this, vmselect can be also configured with storage groups setting to define what it means for a group to fail.

I think this option with vmagents instead of vminserts is more viable as it excludes chance of facing re-routing issues. Once we turn on the Group1, everything remain to work as before. Q1 starts to drain, traffic flows. Despite the fact that Group1 will not have all the data (the queue isn't drained yet), read queries should remain consistent as vmselect will merge results across all groups and will fill the gaps. However, it is important to not lose/turn-off other storage group until the queue is drained and consistency is restored.

I'd like to propose another alternative for HA setup here. From my understanding, "replication factor is 3 and there are 4 groups" setup is supposed to store 3 copies of data on 4 groups. It means none of the groups contains 100% of the data or can handle 100% of ingestion. But maybe, it is worth having less groups, but each group could handle the 100% of the workload? Let's consider the case where we re-distribute 2 groups in a way to have 2 independent VM clusters, each of them contains 100% of data. This topology is described here. This topology provides the following benefits:

Each cluster (AZ) could be disabled without impact on ingestion or reading path
vmselect needs to process less data, as it queries only one cluster(AZ) at time. It is supposed to reduce latency by x2 at least. As well as vmselect resource requirements.
vmstorages needs to process less reads, as it contains less copies of data.

This topology of having interchangeable independent VM clusters seems like the most stable and predictable regarding resource usage. It also opens good maintainability opportunities, as any maintenance work can be held on disabled cluster and its results can be tested without affecting the production ingestion/reading path.

Apr 15 '24 15:04 hagen1778

The commit 90794e84bc75d0524c090cd57d69123f7224cf43 adds support for -globalReplicationFactor command-line flag to vmselect. This flag allows configuring the cross-group replication factor at vmselect, so it could continue returning full responses if up to -globalReplicationFactor - 1 vmstorage groups are unavailable. See these docs for more details.

This commit will be included in the next release. In the mean time it is possible to test the -globalReplicationFactor command-line flag by building vmselect from this commit according to these docs.

Apr 18 '24 22:04 valyala

The commit 8f535f9e76c02b52fa1072cc1bce12f480923f4f adds support for -remoteWrite.shardByURLReplicas command-line flag to vmagent. This flag allows setting up data replication among fault domains additionally to sharding if the number of fault domains is bigger than the replication factor. For example, the following command instructs vmagent to shard data among all the 4 provided -remoteWrite.url destinations, and to send 3 copies of all the outgoing samples to different destinations:

/path/to/vmagent \
  -remoteWrite.shardByURL \
  -remoteWrite.shardByURLReplicas=3 \
  -remoteWrite.url=http://host1/api/v1/write \
  -remoteWrite.url=http://host2/api/v1/write \
  -remoteWrite.url=http://host3/api/v1/write \
  -remoteWrite.url=http://host4/api/v1/write

This allows using a single vmagent for distributing the data among all the fault domains with the replication factor 3 in the scheme above.

This commit will be included in the next release.

Apr 19 '24 09:04 valyala

Cmd-line flags -globalReplicationFactor and -remoteWrite.shardByURLReplicas are available starting from v1.101.0 release.

Apr 26 '24 09:04 hagen1778

@hagen1778 @valyala

As I was going through my checklist, I realised that we also need shardByURLReplicas support in vmalert so that recording rule output samples can be correctly placed in a replica group.

We do not currently require a single vmalert instance to write to multiple groups (In our 'sharded' clusters, one shard is dedicated to recording rule outputs), but for consistency and future proofing, it seems sensible for vmalert to have the same functionality as vmagent in this regard (#6212).

Thanks!

May 23 '24 20:05 plangdale-roblox

I realised that we also need shardByURLReplicas support in vmalert so that recording rule output samples can be correctly placed in a replica group.

vmalert doesn't allow configuring more than remoteWrite.url or remoteRead.url. It is expected that the job of sharding/replicating data is responsibility of the vmagent.

it seems sensible for vmalert to have the same functionality as vmagent in this regard (https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6212).

I do not agree. This adds unnecessary complexity to the vmalert and doesn't align with its functionality. However, it is still possible to route/shard/replicate data via vmagent, and vmalert can be configured to send data to vmagent via remote write protocol.

May 27 '24 07:05 hagen1778

@hagen1778 I've discussed this more internally, and although we have examples today where we are directly writing from vmalerts to vminserts, we agree that there's no reason why we have to do this, and we will update these configurations to write via vmagents.

May 28 '24 17:05 plangdale-roblox