alloy
alloy copied to clipboard
grafana-agent does not use the DNS TTL for the remote_write endpoint
Hi,
I'm not sure if this issue relates to grafana-agent itself or the remote_write prom library.
I have a (non persistent) load balancer between grafana agent and Mimir.
my relevant agent config is:
remote_write:
- url: https://lb.example.com/api/v1/push
headers:
X-Scope-OrgID: example
queue_config:
min_backoff: 5s
max_backoff: 30s
I tried setting the scrape_interval to 10s for testing (normally at 60s)
My workflow enforces periodic load balancer recreation (can't do anything about that) and the IP address of lb.example.com will change. The workflow is along these lines: new LB gets created, the DNS for lb.example.com updated to the new one, the old one will be left in place (for a while) for rollback. However, grafana-agent will keep connecting the old IP indefinitely (the TTL for lb.example.com is 60 seconds). Even if the load balancer comes back with 503s the agent will continue to try and use it.
Not even HUPing the process helps - after sending SIGHUP I can see in the agent logs that config reload was requested but tcpdump still shows connections to the old IPs. The source IP on the agent is the same for each connection so that makes me believe that the connection could be persistent (for performance?)
The only way to force the agent to use the new load balancer is to restart the agent completely.
Documentation doesn't seem to indicate any tweak to this behaviour - would it be possible to force a DNS refresh without restarting the agent?
Thank you.
Some further notes:
If I delete the old load balancer it will return 503s for a few minutes and the agent retries. When the load balancer eventually goes away completely (drops off DNS and the IPs are returned) the agent will refresh DNS and will start using the new load balancer.
I'm not sure if a TCP RST is the trigger of the DNS refresh but it could be a plausible reason.
A potential workaround I just found. I'm using AWS so I tested using a classic ELB to set the connection idle timeout to 1s - this appears to have fixed the problem (probably the ELB closed the connection one second after metrics were sent forcing the agent to refresh the connection and DNS).
As a side effect now each connection is a new one (proved by the changed source port client side) - this will definitely have some performance implications but I can't really quantify them just yet.
Right, currently remote_write from Prometheus keeps the connection to the server alive to avoid needing to do handshakes too frequently, which is especially useful when the throughput is really high.
It'd be technically possible for us to expose a setting to disable keepalives for all remote_write connections, but there'd have to be an upstream change to Prometheus for it to be configurable on a endpoint-by-endpoint basis. Doing it globally feels a bit strange, so I'd personally prefer this to be an upstream change.