VictoriaMetrics icon indicating copy to clipboard operation
VictoriaMetrics copied to clipboard

Service discovery (DNS SD) for cluster version

Open hekmon opened this issue 5 years ago • 5 comments

We actually are managing our VM clusters on plain virtual machines with puppet: scaling out needs a pass on our puppet manifests to update the nodes list which then converts it as flags for VM which triggers a restart of each impacted service.

But as we are deploying Grafana on EKS/Fargate, I was wondering on how VM could be deployed (you do provide official docker img) on such platform. The biggest issue I see is the auto-scaling: you need update each service with new nodes list each time we scale out one of them. On ECS this means a new task revision and then a manual update of the service for each service, not really elastic.

So I was wondering about service discovery (DNS SD) as AWS provides that for each managed container but it would also be usefull on plain virtual machine (eg with consul). For example, instead of adding several -storageNode flags, one could simply specifies -storageNodeDiscovery record_with_SRV_recordtypes.namespace (same on select nodes).

This would simplify horizontal scale out by using consul (or AWS service discovery when using ECS for example). Each insert nodes could be aware of new storage nodes without reconfiguration and restart, same for select nodes: each service could be independently updated (scaled out) without re configuring the others.

I am aware that could increase the complexity of the current KISS workflow but I think the feature would still be worth it (it could be an additional working mode exclusive with the regular -storageNode one).

hekmon avatar Dec 17 '19 14:12 hekmon

Thanks for the detailed feature request!

The following questions must be addressed before proceeding with automatic discovery of vmstorage nodes:

  • vmstorage layer doesn't support downscaling yet, i.e. it is impossible reducing the number of vmstorage nodes without data loss. How to avoid improper downscaling when the number of storage nodes in DNS SD is decreased by an accident?

  • The order of vmstorage nodes that is passed to vminsert via -storageNode flag, must be constant across vminsert nodes in order to keep optimal time series mapping among vmstorage node. New vmstorage nodes must be added to the end of the list in order to reduce time series re-shuffling between new set of vmstorage nodes. How to enforce these properties in DNS SD?

  • When new vmstorage node is added to the cluster, it is important to add this node to vmselect configs first, so they would start querying the newly added vmstorage node. Then the node could be added to vminsert nodes, so they could start writing data to newly added vmstorage node. If the order of config update is reversed or randomized, then vmselect nodes could return incomplete results until they are restarted with the new config containing new vmstorage node. How to enforce the order of config updates in DNS SD case?

  • Currently all the configs for the cluster components are explicitly set via command-line flags. This allows saving these configs in version control systems such as git, so config changes could be easily tracked, audited and rolled back if needed. It is unclear how to achieve the same properties for vmstorage lists managed via DNS SD.

  • DNS SD introduces an additional component to the system. It may require additional efforts for setting up and operating. Also it could become additional point of failure, which could reduce availability and reliability of the whole system. For instance, the whole system may become unusable if DNS SD stops working. How to deal with these cases?

  • Currently it is possible passing different vmstorage lists to vminsert and vmselect nodes. This could be useful when you need to stop writing new data to certain vmstorage nodes, while leaving the ability to query historical data from these nodes. How to achieve this with DNS SD case?

valyala avatar Dec 17 '19 16:12 valyala

Wow did not think this will open pandora box :)

I don't pretend to have answers for all of these questions but here are some thoughts:

vmstorage layer doesn't support downscaling yet, i.e. it is impossible reducing the number of vmstorage nodes without data loss. How to avoid improper downscaling when the number of storage nodes in DNS SD is decreased by an accident?

In a classic DNS SD env, having a fewer entries does not necessarily means scale down: one host migh be currently failing its health check and therefor is removed from the list, this means it could come back up again. Could you consider it as an offline node, waiting for it to come back again ? When a storage node is down, VM continues to work: new data are not sent to it and select are partials but still executed (unless the no partial flag is used). Could you just stick to this behavior ?

The order of vmstorage nodes that is passed to vminsert via -storageNode flag, must be constant across vminsert nodes in order to keep optimal time series mapping among vmstorage node. New vmstorage nodes must be added to the end of the list in order to reduce time series re-shuffling between new set of vmstorage nodes. How to enforce these properties in DNS SD?

SRV records do have priority & weight values which could be used to obtain this, but this will increase the complexity of a workflow supposed to simplify the whole thing (and it does sound like a workaround). Sorting the target value could work on an classic environment but won't work on an env as ECS where SD use UID as target. TL;DR: I don't see how... Is time series re-shuffling absolutely out of the question ?

When new vmstorage node is added to the cluster, it is important to add this node to vmselect configs first, so they would start querying the newly added vmstorage node. Then the node could be added to vminsert nodes, so they could start writing data to newly added vmstorage node. If the order of config update is reversed or randomized, then vmselect nodes could return incomplete results until they are restarted with the new config containing new vmstorage node. How to enforce the order of config updates in DNS SD case?

Ok you got me, DNS SD won't be enough here. The only way I see it: insert should take their storage list from a validated list used by select nodes. But clustering + source of truth means something like etcd which is far away from the VM KISS implementation. This is clearly orchestration which DNS SD can't provide on its own.

Currently all the configs for the cluster components are explicitly set via command-line flags. This allows saving these configs in version control systems such as git, so config changes could be easily tracked, audited and rolled back if needed. It is unclear how to achieve the same properties for vmstorage lists managed via DNS SD.

I don't see it as an issue here: when versioning you want keep configuration values. But you could want your N-1 configuration on your actual cluster (meaning more nodes). Also configuring a docker image is best thru ENV vars. Actually on ECS, the tasks revisions are versioned either for the ENV or the CMD which should not include the cluster size (which is managed on the service not the task). I think both should not be correlated and be tweaked/reverted independently.

DNS SD introduces an additional component to the system. It may require additional efforts for setting up and operating. Also it could become additional point of failure, which could reduce availability and reliability of the whole system. For instance, the whole system may become unusable if DNS SD stops working. How to deal with these cases?

True but when you choose to work with DNS SD you accept it as a critical component :) You could however mitigate certain issues: loosing access to your DNS SD record could mean a warning log fired up but not a reset the actual nodes list: keeping the last valid configuration is often enough to give you time to fix the DNS SD service. But even this safety might not be enough: if auto scaling kicks in just at this moment, new nodes won't have a previous valid list. At this point I think a new node configured to use DNS SD should exit/fatal if the DNS SD service is unreachable or at least be in a stale state indicated on its health endpoint (keeping it out of the loadbalancer).

DNS SD introduces an additional component to the system. It may require additional efforts for setting up and operating. Also it could become additional point of failure, which could reduce availability and reliability of the whole system. For instance, the whole system may become unusable if DNS SD stops working. How to deal with these cases?

For this I can only see one solution: different service names, one for the select & one for the insert. This means different health check and could not be doable on AWS SD / ECS plateform when you get on DNS SD entry (even if you can specify the health check, you only have one).

hekmon avatar Dec 17 '19 17:12 hekmon

@hekmon , thanks for the answers! Such a conversation really helps to understand better possible solutions to the original use case!

In a classic DNS SD env, having a fewer entries does not necessarily means scale down: one host migh be currently failing its health check and therefor is removed from the list, this means it could come back up again. Could you consider it as an offline node, waiting for it to come back again ?

It is unclear how to determine the full list of vmstorage nodes, including temporarily unavailable nodes, from DNS SD. The full list is needed for consistent sharding of incoming time series among vmstorage nodes. Consistent sharding means that all the data for the same time series goes to a single vmstorage node. This optimizes performance and reduces disk space usage comparing to the case when the data for each time series is spread among multiple vmstorage nodes. Additionally, the order of vmstorage nodes in this list shouldn't change. For instance, if the list contains s1, s2 and s3 nodes and s2 temporarily went offline, then the order must remain the same when s2 is offline and after s2 returns to the cluster.

When a storage node is down, VM continues to work: new data are not sent to it and select are partials but still executed (unless the no partial flag is used). Could you just stick to this behavior ?

This is how VictotriaMetrics cluster works now.

Is time series re-shuffling absolutely out of the question ?

It hurts cluster performance, increases CPU usage, RAM usage and disk space usage comparing to the case when the order of vmstorage nodes remains the same.

valyala avatar Dec 18 '19 22:12 valyala

Could it possible use file-based-sd like prometheus/thanos?

hanjm avatar Sep 11 '21 13:09 hanjm

I think there is a compelling use case for service discovery support in cluster version, I hope to help advance this conversation along:

It is unclear how to determine the full list of vmstorage nodes, including temporarily unavailable nodes, from DNS SD.

One option would be to specify server ordering with the priority value in the SRV record, which could be used as a (potentially sparse) array index for the list of vmstorage nodes. This would ensure that the nodes are always sorted consistently, even as nodes get added or removed from the list.

Another option is for the SRV record to return different weight values for healthy vs temporarily unavailable nodes, which could be used to provide the full list of nodes while also distinguishing those which are currently unavailable.

wjordan avatar Feb 01 '22 17:02 wjordan

FYI, VictoriaMetrics gained support for automatic discovery of vmstorage nodes at vmselect and vminsert via SRV records starting from v1.83.0. See these docs. Closing this feature request as done.

valyala avatar Oct 29 '22 01:10 valyala

Hello @valyala, I was wondering if there are any considerations or intentions to open source this particular feature?

aierui avatar Dec 07 '23 08:12 aierui

Hello @valyala , we are migrating our setup from thanos+prometheus stack to victoria metrics. This feature is supported in thanos. Could you please consider to open source this feature?

abhishekaj avatar Jan 29 '24 06:01 abhishekaj