loki icon indicating copy to clipboard operation
loki copied to clipboard

Rotated log files promtail error

Open fernandovillamaria opened this issue 2 years ago • 13 comments

Hello, I have the following problem and I wanted to know if it is possible to solve it. I found this entry https://github.com/grafana/loki/issues/3108 but I am not clear if it could be fixed or not. I attach logs and promtail settings.

Jul 2 12:37:31 mendieta promtail: level=info ts=2021-07-02T15:37:31.741999535Z caller=tailer.go:173 component=tailer msg=“skipping update of position for a file which does not currently exist” Jul 2 12:37:32 mendieta promtail: level=info ts=2021-07-02T15:37:32.115303193Z caller=tailer.go:173 component=tailer msg=“skipping update of position for a file which does not currently exist” Jul 2 12:37:32 mendieta promtail: level=info ts=2021-07-02T15:37:32.358223961Z caller=tailer.go:173 component=tailer msg="skipping update of position for a file which does not currently exist

server: http_listen_port: 9080 grpc_listen_port: 0

positions: filename: /tmp/positions.yaml

clients:

url: http://grafana.pert.io:3100/loki/api/v1/push scrape_configs:

job_name: system static_configs: labels: job: logstashlogs # A job label is fairly standard in prometheus and useful for linking metrics and logs. host: ${HOSTNAME} path: /var/log/logstash/log*.log promtail, version 2.2.1

Regards.

fernandovillamaria avatar Jul 12 '21 17:07 fernandovillamaria

we see this happen consistently as well with our promtail instance, same version.

Kampe avatar Jul 15 '21 17:07 Kampe

Are there any updates on this issue? We are seeing the same thing

hwshea18 avatar Jul 19 '21 13:07 hwshea18

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 21 '21 03:08 stale[bot]

We are also experiencing this issue, any updates?

cesarmesones avatar Sep 27 '21 16:09 cesarmesones

Believe I'm hitting this as well, using the Helm charts for promtail and loki. I've only noticed this on one particular pod, with the same error "skipping update of position...", then followed by "position timer: exited".

I've tried looking through the config, but I'm still new to loki/promtail and nothing obvious is jumping out to me.

Edit: I tried killing the pod in question along with promtail and I think that might have resolved it. After removing the problematic pod, promtail detects the change with "seeked offset:0", followed by "tail routine: started". Thought I tried this before. but hard to be sure, I'll reply back if it pops back up again.

taintedkernel avatar Oct 10 '21 04:10 taintedkernel

Did some quick study.

msg=skipping update of position for a file which does not currently exist comes from promtail file tailer when the underlying file doesn't exist (which is weird). Mainly because its dependency package hptail closes the file with its filehander to nil

My guess is during log rotation, there is a race between filereopen and tailing may be?

I think I can try reproducing the issue sometime this week if I have a bandwidth!.

kavirajk avatar Oct 11 '21 09:10 kavirajk

https://github.com/grafana/loki/issues/3108#issuecomment-1054916428

yangvipguang avatar Mar 01 '22 02:03 yangvipguang

@cyriltovena I have the same issue . image 5EFA29F0-9E89-4226-8013-34F862317560

after 18:33:31.887. No logs have been reported, and the promtail logs indicate it is waiting for logs to appear, but Application log output is normal。 image image

yangvipguang avatar Mar 01 '22 02:03 yangvipguang

I got the same error in log "skipping update of position for a file which does not currently exist" because of abandoned log files under windows (romtail, version 2.5.0 (branch: HEAD, revision: 2d9d0ee23))

pehon avatar May 24 '22 15:05 pehon

I got the same error in log "skipping update of position for a file which does not currently exist" because of abandoned log files under windows (romtail, version 2.5.0 (branch: HEAD, revision: 2d9d0ee))

I can confirm this is regularly an issue with promtail 2.5.0 on windows.

Mario-Hofstaetter avatar May 25 '22 07:05 Mario-Hofstaetter

we have the same issue here, indeed the file are renamed by other application.

iris-xie avatar Jun 15 '22 01:06 iris-xie

We are seeing this on two promtail pods (out of ten) on one of our clusters and those 2 promtail pods also have really high CPU usage (compared to other promtail pods and to what we'd expect). Does anyone else seeing this issue see any correlation with CPU usage?

Edited to add:

Promtail Helm Chart Meta: Chart version: 5.1.0 Promtail version: 2.5.0

We are on linux nodes.

erewok avatar Jul 06 '22 15:07 erewok

Today I updated our Promtail pods to version 2.6.1 using the latest version of the Helm chart and looked again at this issue.

While looking at our logs, I noticed our loki-write pods (in a statefulset) were getting overloaded by requests from the promtail clients. It turns out I had too few loki-write nodes, so I scaled up the statefulset. After that, so far this morning both Loki-write and Promtail look to be performing better.

I have a guess (a complete shot in the dark, so take it for its worth), that our promtail pods were not proceeding because there weren't enough loki-write nodes and that may be why we were seeing these log messages repeated for long stretches.

Now that our Loki pods are looking more healthy, I see the log messages discussed here whenever a pod terminates followed by messages that look to be removing this tailer from the promtail tailer manager:

level=info ts=2022-08-09T18:31:04.293828288Z caller=filetarget.go:161 msg="filetarget: watcher closed, tailer stopped, positions saved" path=/var/log/pods/*c81337cd-312b-4d45-a759-eb10775c843b/linkerd-proxy/*.log
level=info ts=2022-08-09T18:31:04.293801388Z caller=tailer.go:240 component=tailer msg="stopped tailing file" path=/var/log/pods/finwise_finwise-celery-7cf9d46c9d-sj5nr_c81337cd-312b-4d45-a759-eb10775c843b/linkerd-proxy/0.log
level=info ts=2022-08-09T18:31:04.293778888Z caller=tailer.go:152 component=tailer msg="tail routine: exited" path=/var/log/pods/finwise_finwise-celery-7cf9d46c9d-sj5nr_c81337cd-312b-4d45-a759-eb10775c843b/linkerd-proxy/0.log
level=info ts=2022-08-09T18:31:04.293758088Z caller=tailer.go:159 component=tailer msg="tail routine: tail channel closed, stopping tailer" path=/var/log/pods/finwise_finwise-celery-7cf9d46c9d-sj5nr_c81337cd-312b-4d45-a759-eb10775c843b/linkerd-proxy/0.log reason=null

I will keep looking at it. Apologies for the uninformed conjecture but I figured it may be worthwhile to someone?

erewok avatar Aug 09 '22 19:08 erewok

@kavirajk Any progress on this. Still happens in out setup.

michaelkebe avatar Nov 08 '22 10:11 michaelkebe

Don't know. Maybe this is related? https://github.com/grafana/tail/issues/9 and the closing pull request https://github.com/grafana/tail/pull/10.

In our setup promtail consumes logs from an NFS mount.

michaelkebe avatar Nov 16 '22 10:11 michaelkebe

Here also on EKS with linux nodes. And the agent stops sending new logs when this happens. It must be forcibly terminated in order for a new pod to get re-scheduled.

daper avatar Dec 14 '22 12:12 daper

I am using 2.7.0, sometimes when log file removed and then recreated, promtail would stuck and no log read afterward.

Promtail log shows following sequence:

ts=2022-12-20T06:39:51.126377231Z caller=log.go:168 level=info msg="Waiting for ****.log to appear..."
...
level=info ts=2022-12-20T06:39:53.846332941Z caller=filetargetmanager.go:175 msg="received file watcher event" name=****.log op=CREATE

feel unbelievable to find this code used by promtail when I tried to dig into https://github.com/grafana/loki/blob/88f04beb4ef39fa57db20ea21d546fe0a2675fb8/vendor/github.com/grafana/tail/tail.go#L254-L268

The retry counter never count down to zero....

zq-david-wang avatar Dec 20 '22 07:12 zq-david-wang

After fix the silly retry countdown bug, following logs show up

Dec 21 06:00:22 iZuf6h6buntpmoj3kqqdrgZ promtail[17833]: level=info ts=2022-12-20T22:00:22.259371564Z caller=tailer.go:159 component=tailer msg="tail routine: tail channel closed, stopping tailer" path=****.log reason="gave up trying to reopen log file with a different handle"
Dec 21 06:00:22 iZuf6h6buntpmoj3kqqdrgZ promtail[17833]: level=info ts=2022-12-20T22:00:22.383426439Z caller=tailer.go:159 component=tailer msg="tail routine: tail channel closed, stopping tailer" path=****.log reason="gave up trying to reopen log file with a different handle"

Meaning that somehow promtail could be stuck in loop without the fix and the countdown/timeout is necessary for promtail to give up and retry. I have run with the fixed code for more than 24h, with log file delete/recreate several times, every log steams seam ok .

zq-david-wang avatar Dec 21 '22 09:12 zq-david-wang

@kavirajk any chance you could help fixing this - getting https://github.com/grafana/tail/pull/16 to merge?

natalya-onramp avatar May 10 '23 13:05 natalya-onramp

https://github.com/grafana/tail/pull/16 solved some failure modes in this code, but there is also an unrelated goroutine leak, which is addressed by https://github.com/grafana/loki/pull/9495. Please let us know if you are still experiencing the issues after these fixes.

thampiotr avatar May 23 '23 09:05 thampiotr

is this still a problem? sounds like the PRs @thampiotr mentioned fixed it?

cstyan avatar Nov 08 '23 01:11 cstyan

The fixes above have resolved the same symptoms in our internal cluster, so I believe it can be closed.

thampiotr avatar Nov 13 '23 12:11 thampiotr