loki
loki copied to clipboard
Rotated log files promtail error
Hello, I have the following problem and I wanted to know if it is possible to solve it. I found this entry https://github.com/grafana/loki/issues/3108 but I am not clear if it could be fixed or not. I attach logs and promtail settings.
Jul 2 12:37:31 mendieta promtail: level=info ts=2021-07-02T15:37:31.741999535Z caller=tailer.go:173 component=tailer msg=“skipping update of position for a file which does not currently exist” Jul 2 12:37:32 mendieta promtail: level=info ts=2021-07-02T15:37:32.115303193Z caller=tailer.go:173 component=tailer msg=“skipping update of position for a file which does not currently exist” Jul 2 12:37:32 mendieta promtail: level=info ts=2021-07-02T15:37:32.358223961Z caller=tailer.go:173 component=tailer msg="skipping update of position for a file which does not currently exist
server: http_listen_port: 9080 grpc_listen_port: 0
positions: filename: /tmp/positions.yaml
clients:
url: http://grafana.pert.io:3100/loki/api/v1/push scrape_configs:
job_name: system static_configs: labels: job: logstashlogs # A job label is fairly standard in prometheus and useful for linking metrics and logs. host: ${HOSTNAME} path: /var/log/logstash/log*.log promtail, version 2.2.1
Regards.
we see this happen consistently as well with our promtail instance, same version.
Are there any updates on this issue? We are seeing the same thing
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
We are also experiencing this issue, any updates?
Believe I'm hitting this as well, using the Helm charts for promtail and loki. I've only noticed this on one particular pod, with the same error "skipping update of position...", then followed by "position timer: exited".
I've tried looking through the config, but I'm still new to loki/promtail and nothing obvious is jumping out to me.
Edit: I tried killing the pod in question along with promtail and I think that might have resolved it. After removing the problematic pod, promtail detects the change with "seeked
Did some quick study.
msg=skipping update of position for a file which does not currently exist
comes from promtail file tailer when the underlying file doesn't exist (which is weird). Mainly because its dependency package hptail closes the file with its filehander to nil
My guess is during log rotation, there is a race between filereopen and tailing may be?
I think I can try reproducing the issue sometime this week if I have a bandwidth!.
https://github.com/grafana/loki/issues/3108#issuecomment-1054916428
@cyriltovena I have the same issue .
after 18:33:31.887. No logs have been reported, and the promtail logs indicate it is waiting for logs to appear, but Application log output is normal。
I got the same error in log "skipping update of position for a file which does not currently exist" because of abandoned log files under windows (romtail, version 2.5.0 (branch: HEAD, revision: 2d9d0ee23))
I got the same error in log "skipping update of position for a file which does not currently exist" because of abandoned log files under windows (romtail, version 2.5.0 (branch: HEAD, revision: 2d9d0ee))
I can confirm this is regularly an issue with promtail 2.5.0 on windows.
we have the same issue here, indeed the file are renamed by other application.
We are seeing this on two promtail pods (out of ten) on one of our clusters and those 2 promtail pods also have really high CPU usage (compared to other promtail pods and to what we'd expect). Does anyone else seeing this issue see any correlation with CPU usage?
Edited to add:
Promtail Helm Chart Meta: Chart version: 5.1.0 Promtail version: 2.5.0
We are on linux nodes.
Today I updated our Promtail pods to version 2.6.1
using the latest version of the Helm chart and looked again at this issue.
While looking at our logs, I noticed our loki-write
pods (in a statefulset) were getting overloaded by requests from the promtail clients. It turns out I had too few loki-write
nodes, so I scaled up the statefulset. After that, so far this morning both Loki-write and Promtail look to be performing better.
I have a guess (a complete shot in the dark, so take it for its worth), that our promtail pods were not proceeding because there weren't enough loki-write nodes and that may be why we were seeing these log messages repeated for long stretches.
Now that our Loki pods are looking more healthy, I see the log messages discussed here whenever a pod terminates followed by messages that look to be removing this tailer
from the promtail tailer manager:
level=info ts=2022-08-09T18:31:04.293828288Z caller=filetarget.go:161 msg="filetarget: watcher closed, tailer stopped, positions saved" path=/var/log/pods/*c81337cd-312b-4d45-a759-eb10775c843b/linkerd-proxy/*.log
level=info ts=2022-08-09T18:31:04.293801388Z caller=tailer.go:240 component=tailer msg="stopped tailing file" path=/var/log/pods/finwise_finwise-celery-7cf9d46c9d-sj5nr_c81337cd-312b-4d45-a759-eb10775c843b/linkerd-proxy/0.log
level=info ts=2022-08-09T18:31:04.293778888Z caller=tailer.go:152 component=tailer msg="tail routine: exited" path=/var/log/pods/finwise_finwise-celery-7cf9d46c9d-sj5nr_c81337cd-312b-4d45-a759-eb10775c843b/linkerd-proxy/0.log
level=info ts=2022-08-09T18:31:04.293758088Z caller=tailer.go:159 component=tailer msg="tail routine: tail channel closed, stopping tailer" path=/var/log/pods/finwise_finwise-celery-7cf9d46c9d-sj5nr_c81337cd-312b-4d45-a759-eb10775c843b/linkerd-proxy/0.log reason=null
I will keep looking at it. Apologies for the uninformed conjecture but I figured it may be worthwhile to someone?
@kavirajk Any progress on this. Still happens in out setup.
Don't know. Maybe this is related? https://github.com/grafana/tail/issues/9 and the closing pull request https://github.com/grafana/tail/pull/10.
In our setup promtail consumes logs from an NFS mount.
Here also on EKS with linux nodes. And the agent stops sending new logs when this happens. It must be forcibly terminated in order for a new pod to get re-scheduled.
I am using 2.7.0, sometimes when log file removed and then recreated, promtail would stuck and no log read afterward.
Promtail log shows following sequence:
ts=2022-12-20T06:39:51.126377231Z caller=log.go:168 level=info msg="Waiting for ****.log to appear..."
...
level=info ts=2022-12-20T06:39:53.846332941Z caller=filetargetmanager.go:175 msg="received file watcher event" name=****.log op=CREATE
feel unbelievable to find this code used by promtail when I tried to dig into https://github.com/grafana/loki/blob/88f04beb4ef39fa57db20ea21d546fe0a2675fb8/vendor/github.com/grafana/tail/tail.go#L254-L268
The retry counter never count down to zero....
After fix the silly retry countdown bug, following logs show up
Dec 21 06:00:22 iZuf6h6buntpmoj3kqqdrgZ promtail[17833]: level=info ts=2022-12-20T22:00:22.259371564Z caller=tailer.go:159 component=tailer msg="tail routine: tail channel closed, stopping tailer" path=****.log reason="gave up trying to reopen log file with a different handle"
Dec 21 06:00:22 iZuf6h6buntpmoj3kqqdrgZ promtail[17833]: level=info ts=2022-12-20T22:00:22.383426439Z caller=tailer.go:159 component=tailer msg="tail routine: tail channel closed, stopping tailer" path=****.log reason="gave up trying to reopen log file with a different handle"
Meaning that somehow promtail could be stuck in loop without the fix and the countdown/timeout is necessary for promtail to give up and retry. I have run with the fixed code for more than 24h, with log file delete/recreate several times, every log steams seam ok .
@kavirajk any chance you could help fixing this - getting https://github.com/grafana/tail/pull/16 to merge?
https://github.com/grafana/tail/pull/16 solved some failure modes in this code, but there is also an unrelated goroutine leak, which is addressed by https://github.com/grafana/loki/pull/9495. Please let us know if you are still experiencing the issues after these fixes.
is this still a problem? sounds like the PRs @thampiotr mentioned fixed it?
The fixes above have resolved the same symptoms in our internal cluster, so I believe it can be closed.