agent
agent copied to clipboard
Grafana agent operator replication doesn't work in case of metrics
Hello team , Please help with the below issue
Summary: Grafana agent replication doesn't work
Description:
I added some change at k8s level which cause grafana agent to restart so when the first replica went down I see that we are having metrics loss and the second replica does not pick up the work .
To test this I added a node selector on grafana agent to make sure that one pod is down for a longer period of time and I made sure that the node group doesn't have enough memory .
So what happened is that replica-1 went down but replica 0 was running .When I checked in grafana I see that we are facing a loss in metrics because grafana agent replica-0 is not scraping the metrics .
Below are the queries I used to verify the same .
The below query will give me the jobs that are missing now but were present 12 hours earlier .Also verified the same with absent queries .
(sum by(job)(scrape_samples_scraped offset 12h )) unless (sum by(job)(scrape_samples_scraped ) )
Steps to reproduce :
- Spin up grafana agent with replication 2 and scrape some metrics.
- Bring down one pod and then query grafana with the above query .You will see that some metrics are missing even though the replica is running.
Grafana Agent image : grafana/agent:v0.30.2 Note : We are using grafana agent operator and all the crds like Metrics instance etc.
hey @amanpruthi !
It is unclear to me whether your grafana-agent instances purposed for metrics are actually specifying the __replica__
label necessary for HA tracker to failover to the other replica. You could check this is the case or not via the /distributor/ha_tracker
endpoint in distributors (port-forwarding or using an ingress if you have access to it).
We suffered a similar problem, but in our case, the problem was that even after enabling HA tracking properly, the fact that a single cluster
label is used for all the shards, turns all HA shards into one (when actually, you would want to have as many HA pairs for the clusters as shards, so as to not have any metric loss / data gap).
We are considering to use a different, meta cluster
(say __cluster__
label, to be dropped after ingestion) that we can use to identify each of the remote write HA pairs we want to stablish (in the cases we had a cluster where we need more than one shard to be able to scale)
@dgonzalezruiz honestly i believe HA isn't supported right now by Grafana Agent. What you would need is support for running multiple replica's (even when deployment is set to daemonset) and a given replica should set a replica label so the backend (Cortex/Mimir) can properly handle HA.
When enabling clustering it only divides the load within a single replica...
What we would need is support similar to Prometheus fields: prometheusExternalLabelName and replicaExternalLabelName:
DESCRIPTION:
Name of Prometheus external label used to denote the Prometheus instance
name. The external label will _not_ be added when the field is set to the
empty string (`""`).
Default: "prometheus"
FIELD: replicaExternalLabelName <string>
DESCRIPTION:
Name of Prometheus external label used to denote the replica name. The
external label will _not_ be added when the field is set to the empty string
(`""`).
Default: "prometheus_replica"
For what it is worth, I ended up being able to solve this issue by changing the cluster label used by the remote Cortex/Mimir cluster, to another "meta-value", hidden label that I would also set to remove by distributors before ingestion, which I named cluster (keeping existing, normal cluster label for metric topology/querying)
Then, I set each Grafana shard to use its shard value as replica; this allowed for any HA-pair (replica pair of Shards, or normal single Prometheus) remote writer set up in my org, to be received as an unique pair, hence allowing to fail over to each of the shards' replica metrics in the normal HA manner.
I would say unless replication for Grafana shard agents is set that way, it is completely useless for HA purposes. Hope this helps someone.
El mié, 1 nov 2023, 11:50, paul-bormans @.***> escribió:
@dgonzalezruiz https://github.com/dgonzalezruiz honestly i believe HA isn't supported right now by Grafana Agent. What you would need is support for running multiple replica's (even when deployment is set to daemonset) and a given replica should set a replica label so the backend (Cortex/Mimir) can properly handle HA.
When enabling clustering it only divides the load within a single replica...
What we would need is support similar to Prometheus fields: prometheusExternalLabelName and replicaExternalLabelName:
`FIELD: prometheusExternalLabelName
DESCRIPTION: Name of Prometheus external label used to denote the Prometheus instance name. The external label will not be added when the field is set to the empty string (""). Default: "prometheus"
FIELD: replicaExternalLabelName
DESCRIPTION: Name of Prometheus external label used to denote the replica name. The external label will not be added when the field is set to the empty string (""). Default: "prometheus_replica" `
— Reply to this email directly, view it on GitHub https://github.com/grafana/agent/issues/3562#issuecomment-1788757302, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3FTU6DJJ4MPJMUIEWLKPTYCISQDAVCNFSM6AAAAAAXCEV77CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBYG42TOMZQGI . You are receiving this because you were mentioned.Message ID: @.***>