agent [Clustering] Filter the label "instance" from the hash computation

PR Description

In clustering mode, the targets exposed by exporters are distributed amongst the agents. For an agent to know whether it should scrape the target or not, it will compute a hash based on the labels of the target.

We noticed that the agent adds a label "instance" corresponding by default to the host where the agent runs to the targets. This is a problem because if the agents are not running on the same host, they will have a different value for the label "instance". This means that they will compute different hashes and the targets won't be distributed correctly.

Notes to the Reviewer

Is there a scenario where a target can only be uniquely identified by the instance label? If that's the case, then the agent would compute the same hash for different targets which would result in an unbalanced cluster.

PR Checklist

[x] CHANGELOG.md updated
[x] Tests updated

Mar 28 '24 16:03 wildum

Converting to draft for now because we think it might be a risky change and we don't want it in 1.0

Mar 28 '24 17:03 wildum

I'm not convinced this is the right solution; feels too arbitrary. One of clustering's main assumptions is that all nodes run with the same configuration files and that the upstream APIs (eg. service discovery) work in the same way so that the fully-local consistent hashing works properly. The prometheus.exporter.* components may set a custom instance key related to the system they're running on, but this is settable.

IMHO this could be part of the documentation for running clustering together with exporter components. Nice find nonetheless!

Apr 01 '24 10:04 tpaschalis

In this example the agents have the same config files, they just run on different hosts. Do you think that we should constrain the agents to run on the same host if they use clustering + prometheus exporter or should we fix the exporters so that we don't set the instance label to the host where the agent is running? If possible I think that this "instance" label should be removed. Its behavior is not consistent amongst exporters and the agent's host should not be inside of the targets' labels. But I guess that it would be a breaking change if used in some pipeline or dashboard for filtering

Apr 02 '24 07:04 wildum

We noticed that the agent adds a label "instance" corresponding by default to the host where the agent runs to the targets.

Is this correct? According to this doc page and this blog post, the instance label should normally be set to the address of the scrape target.

Apr 04 '24 11:04 ptodev

In the agent we set it by default to the host of the agent if I understand it correctly here: https://github.com/grafana/agent/blob/main/internal/component/prometheus/exporter/exporter.go#L179

Apr 04 '24 12:04 wildum

I think the bug in the clustering can be fixed if SNMP exporter's InstanceKey function is changed to return the addresses of the targets which are actually being scraped, similarly to the MSSQL exporter InstanceKey function. I could raise a PR for this,, since I'm oncall? And then we could close this one?

Apr 04 '24 12:04 ptodev

that would be a start but if we want the same behavior as this PR we need to make sure that all exporters (except the ones self, process, UNIX (maybe others?)) set the instance to the target's host and not to the agent's one

Apr 04 '24 13:04 wildum

This PR has not had any activity in the past 30 days, so the needs-attention label has been added to it. If you do not have enough time to follow up on this PR or you think it's no longer relevant, consider closing it. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your PR will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!

May 10 '24 00:05 github-actions[bot]

After taking a closer look, I'm also not in favour of this solution, but I 100% agree we should do something to address this problem.

Thinking about this, it seems to me that instance is a problem because we run embedded exporters in a cluster of N collectors. This means we will run multiple instances of these exporters which may also be undesired due to increased load on systems from which metrics are exported. So I think we should provide means to run (or scrape) given exporter only on one instance at a time and then including the instance label should be harmless, it merely denotes which Alloy instance did the scraping. This would be accomplished by more fine-grained component scheduling.

In case we don't want the instance label to denote which Alloy instance in a cluster did the scraping, we can opt to not set the instance label at all and let Prometheus set it automatically. I wouldn't recommend this option, as this would be, AFAICT, an localhost network with a random port on which given exporter component exposes metrics.

To make this less error-prone, we could also check whether we are running in a cluster in exporters components and do something to warn the user?

Jun 06 '24 10:06 thampiotr

@thampiotr If a prometheus.exporter only does work when it's scraped, then it won't be a performance issue to run them in clustered mode. I don't know how most exporters are designed in practice, but I suppose most would only do work when they are scraped. If there is only one collector which scrapes them, all will be good and there won't be a performance hit.

I think we need to go through all prometheus.exporter components and make sure that the instance label is useful, and that it also works with clustering. Maybe we can create a tracker issue for that which lists all exporter components. And we can gradually create component-specific "sub-issues" which look into each component's instance label.

Jun 07 '24 15:06 ptodev

Thanks @ptodev for the discussion (offline). I'm going to summarise it here:

The performance is not the biggest concern, but a nice-to-have.
We both prefer to have a solution that is general to all exporters and doesn't require going through all the exporters one-by-one.
With a standalone exporter and with Agent Flow Mode and Alloy without clustering, the instance label will be typically set to the host where the exporter runs, which may be different to where the thing being exported runs (for example MySQL).
We want to focus on fixing the clustering issue here - making the instance label more useful (for example, be set to the hostname of the thing being exported) can be done separately.
Since fine-grained component scheduling is something we will need for other use-cases, such as kubernetes events, we agree that it's a good idea to use that planned feature to address this problem.
In the meantime, users can work around the problem quite easily by relabelling the instance label when using clustering to the same value in all instances in the cluster, for example:

discovery.relabel "replace_instance" {
  targets = discovery.file.targets.targets
  rule {
    action        = "replace"
    source_labels = ["instance"]
    target_label  = "instance"
	replacement   = "alloy-cluster"
  }  
}

With this, I will close this PR for now and add a note to fine-grained scheduling issue.

Jun 10 '24 09:06 thampiotr