failing loki remote backend prevents working backends from receiving data regularly while forwarding logs to multiple loki clients at once
What's wrong?
I've noticed that in case of a multi loki clients setup to forward logs to, if one of the loki clients starts failing for some reason, eg. - no process listening on the specified port, etc, it starves other working loki endpoints to receive data as well until the failing client exhausts all of its max_retries (default = 10). Once the loop gets reset, the same issue repeats itself again.
In the end, the working clients only get the data every 6 minutes or so based on what the max_period is set to (Default = 5m). This also leads to "gaps" in the grafana dashboard while looking at the data for those clients,
Steps to reproduce
Take a look at this nominal config -
./agent-local-config.yaml
server:
log_level: info
logs:
configs:
- clients:
- tls_config:
insecure_skip_verify: true
basic_auth:
password: xxxx
username: loki
url: https://logs.my-loki-instance.net/loki/api/v1/push
- tls_config:
insecure_skip_verify: true
url: https://localhost:13100/loki/api/v1/push
# backoff_config:
# # max_retries: 10
# max_period: 10s
name: default
positions:
filename: /data/grafana_agent/log-positions.yml
scrape_configs:
- job_name: nginx
pipeline_stages:
- regex:
expression: (?P<remote_addr>\S+) - (?P<remote_user>\S+) \[(?P<time_local>[^]]+)\]
"(?P<request_method>[A-Z]+) (?P<request_url>[^? ]+)[?]*(?P<request_url_params>\S*)
(?P<request_http_version>[^"]+)" (?P<status_code>\d+) (?P<body_bytes_sent>\d+)
"(?P<http_referer>[^"]+)" "(?P<http_user_agent>[^"]+)" "(?P<http_x_forwarded_for>[^"]+)"
- labels:
remote_user: null
request_http_version: null
request_method: null
request_url: null
status_code: null
- timestamp:
format: 02/Jan/2006:15:04:05 -0700
source: time_local
static_configs:
- labels:
__path__: /var/log/nginx.log
instance: dist1.foobar.com
job: nginx
targets:
- dist1.foobar.com
Start the agent as
# /tmp/agent: ./grafana-agent --config.file ./agent-local-config.yaml
Now, let's assume that the localhost:13100 instance is missing for some reason. In such a case I expected the other endpoint (logs.my-loki-instance) to be able to receive data at the configured scrape intervals (60s), but that doesn't happen as explained above.
System information
Linux 6.5.0-15-generic
Software version
Grafana Agent 0.35.0 and master atm
Configuration
server:
log_level: info
logs:
configs:
- clients:
- tls_config:
insecure_skip_verify: true
basic_auth:
password: xxxx
username: loki
url: https://logs.my-loki-instance.net/loki/api/v1/push
- tls_config:
insecure_skip_verify: true
url: https://localhost:13100/loki/api/v1/push
# backoff_config:
# # max_retries: 10
# max_period: 10s
name: default
positions:
filename: /data/grafana_agent/log-positions.yml
scrape_configs:
- job_name: nginx
pipeline_stages:
- regex:
expression: (?P<remote_addr>\S+) - (?P<remote_user>\S+) \[(?P<time_local>[^]]+)\]
"(?P<request_method>[A-Z]+) (?P<request_url>[^? ]+)[?]*(?P<request_url_params>\S*)
(?P<request_http_version>[^"]+)" (?P<status_code>\d+) (?P<body_bytes_sent>\d+)
"(?P<http_referer>[^"]+)" "(?P<http_user_agent>[^"]+)" "(?P<http_x_forwarded_for>[^"]+)"
- labels:
remote_user: null
request_http_version: null
request_method: null
request_url: null
status_code: null
- timestamp:
format: 02/Jan/2006:15:04:05 -0700
source: time_local
static_configs:
- labels:
__path__: /var/log/nginx.log
instance: dist1.foobar.com
job: nginx
targets:
- dist1.foobar.com
### Logs
```text
Mar 18 20:28:22 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:22.36522416Z caller=client.go:430 level=error component=logs logs_config=default component=client host=localhost:13100 msg="final error sending batch" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:22 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:22.507835563Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:23 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:23.271720016Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:25 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:25.123445134Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:28 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:28.795872338Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:35 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:35.337596441Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:51 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:51.028375765Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:29:08 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:29:08.033159675Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:29:40 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:29:40.383066904Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:31:09 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:31:09.086003766Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
From whatever I can tell with my limited knowledge of golang, and channels, it appears that there are 2 goroutines (in this case) - one for localhost:13100, other for logs.my-loki-instance.net in the grafana-agent process. Both of them are reading form the same channel (api.Entry) which is being populated in the promtail package in grafana/clients/pkg/promtail/targets/file/tailer.go readLines() function. As the localhost:13100 goroutine gets blocked due to failling into retries and exponential backoffs, it delays the other my-loki goroutine from receiving data too - atleast my tests confirm this. Is this due to the fact that the underlying api.Entry channel is "full" due to 1 of the 2 receivers being tied up elsewhere? My tests show that as soon as the failing goroutine unblocks after exhausting its retries, both receivers receive data pretty much immediately.
Hi there :wave:
On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.
To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)
This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!
This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!