loki icon indicating copy to clipboard operation
loki copied to clipboard

Promtail stops shipping logs when an error is encountered in journald

Open tifling85 opened this issue 4 years ago • 30 comments

Describe the bug This problem is very similar to #2812. We have configured promtail to forward logs from journald running as a SystemD module. Promtail can start writing errors to the journald, while stopping the transfer of logs to Loki.

To Reproduce Steps to reproduce the behavior:

  1. Run promtail as a SystemD unit streaming and forwarding journald.
  2. Received error during sdjournal follow (we are working on trying to figure out what might be causing this error here, bad message seems to indicate the stream encountered a non text/string character)
  3. New log entries will not appear in Loki until the promtail SystemD unit is restarted.

Expected behavior Promtail continues to stream and forward logs or exits so it can be restarted by SystemD.

Environment: Infrastructure:

  • Linux core-1.novalocal 4.18.0-305.19.1.el8_4.x86_64, AlmaLinux release 8.4 (Electric Cheetah)
  • systemd unit for promtail 2.3.0

Screenshots, Promtail config, or terminal output Log from promtail:

promtail[7497]: level=error ts=2021-10-07T04:48:15.338399632Z caller=journaltarget.go:186 msg="received error during sdjournal follow" err="failed to iterate journal: bad message"

And similar logs in Promtail there is a huge amount.

Promtail config:

server:
  disable: true
  http_listen_address: 192.168.199.173
  http_listen_port: 9080
  grpc_listen_address: 192.168.199.173

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://192.168.199.173:3100/loki/api/v1/push

scrape_configs:

- job_name: journal
  journal:
    max_age: 12h
    labels:
      job: systemd-journal
      host: 192.168.199.173
  relabel_configs:
    - source_labels: ['__journal__systemd_unit']
      target_label: 'unit'
    - source_labels: ['__journal__hostname']
      target_label: 'hostname'

tifling85 avatar Oct 07 '21 05:10 tifling85

The same problem

level=error ts=2021-10-25T02:58:46.562434288Z caller=journaltarget.go:187 msg="received error during sdjournal follow" err="Timeout expired"
level=error ts=2021-10-25T02:58:46.562514578Z caller=journaltarget.go:190 msg="unable to follow journal" err="Timeout expired"
- job_name: journal
  journal:
    max_age: 1h
    labels:
      _app_: journal
      _env_: dev
  relabel_configs:
  - source_labels:
    - __journal__systemd_unit
    target_label: unit

wang1219 avatar Oct 25 '21 03:10 wang1219

Same here

tomjohnburton avatar Dec 10 '21 13:12 tomjohnburton

Also facing this problem using v2.4.1

ecusnir avatar Jan 03 '22 14:01 ecusnir

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

stale[bot] avatar Mar 02 '22 20:03 stale[bot]

Bump!

tifling85 avatar Mar 03 '22 02:03 tifling85

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

stale[bot] avatar Apr 17 '22 06:04 stale[bot]

bump please

tifling85 avatar Apr 17 '22 13:04 tifling85

bump.

zilahu avatar May 23 '22 17:05 zilahu

bump.

fluential avatar Jun 27 '22 14:06 fluential

up

rmalenko avatar Aug 05 '22 09:08 rmalenko

same problem here...

yurividal avatar Aug 21 '22 07:08 yurividal

+1 promtail 2.6.1 binary release (running in console without systemd unit now) host centos 8 stream, systemd-239-58.el8.x86_64

level=error ts=2022-09-05T11:31:13.498145588Z caller=journaltarget.go:192 msg="received error during sdjournal follow" err="Timeout expired" level=error ts=2022-09-05T11:31:13.498291736Z caller=journaltarget.go:195 msg="unable to follow journal" err="Timeout expired"

cervajs avatar Sep 05 '22 14:09 cervajs

same problem here (fc36)

jmcfarlane avatar Oct 07 '22 01:10 jmcfarlane

Actually mine was not the same problem, I was missing:

--volume=/etc/machine-id:/etc/machine-id:ro

:+1:

jmcfarlane avatar Oct 07 '22 05:10 jmcfarlane

UP

Pikw avatar Oct 14 '22 14:10 Pikw

Same for me. Does anyone have a workaround for it?

esthon avatar Oct 23 '22 14:10 esthon

Hey,

The original issue posted seems to be different to some of the discussion threads, but maybe this will help some who see this error: msg="unable to follow journal" err="Timeout expired"

I had the same error, the reason being I do not have persistent storage enabled for journal (e.g. I have a /run/log/journal path, but no /var/log/journal path)

The docs suggest it will check both paths, but in my case I had to explicitly add the path var and limit to runtime journal. e.g.

  - job_name: journal
    journal:
      path: /run/log/journal
      max_age: 1h
      labels:
        job: systemd-journal
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: 'unit'

As soon as I added that path everything started working, I'm running promtail v2.6.1 and centos 7.

ashdavid avatar Oct 28 '22 08:10 ashdavid

I am also having the "unable to follow journal" "Timeout expired" issue. It applies to all my journald scrapers and thereby fully undermines confidence in journald scraping.

Here is the line of code where this error is thrown.

Schmoho avatar Dec 06 '22 11:12 Schmoho

I had the same issue but fixed it by replacing path: /run/log/journal with path: /var/log/journal

in my Ubuntu 22.04 VM the journal logs are in path: /var/log/journal

ipHeaders avatar Dec 27 '22 19:12 ipHeaders

I've noticed this same problem sometimes occurs if the network connection between promtail and loki is severed. I'm not able to reproduce this, however, but worth noting as a data point here.

Logicwax avatar Jan 06 '23 00:01 Logicwax

actually, I'm about to reliably reproduce this now. The issue is that I have promtail connecting to a remote host via an ssh tunnel. If I take down the tunnel, sever it in any way, and bring it back up, then promtail will never auto-reconnect properly. The same issue pops up until I restart promtail.

Logicwax avatar Feb 03 '23 06:02 Logicwax

Running into this too:

level=error ts=2023-04-25T19:01:05.140074112Z caller=journaltarget.go:205 msg="received error during sdjournal follow" err="failed to iterate journal: bad message"

Is there a workaround, short of deleting the entire systemd journal?

remram44 avatar Apr 25 '23 19:04 remram44

@remram44 yes just restart promtail sudo systemctl restart promtail and it should sync back up.

Logicwax avatar Apr 25 '23 21:04 Logicwax

https://github.com/grafana/loki/pull/9155 this should fix the issue by letting promtail restart automatically

farodin91 avatar Apr 28 '23 08:04 farodin91

#9155 this should fix the issue by letting promtail restart automatically

I just observed an constant stream of these errors:

promtail-linux-amd64[3829221]: level=error ts=2023-07-19T08:51:50.694001179Z caller=journaltarget.go:205 msg="received error during sdjournal follow" err="failed to iterate journal: bad message"

with Promtail 2.8.2.

A restart of Promtail then "fixed" it. Isn't this supposed to be covered by #9155 or likely a new issue @dannykopping @farodin91?

frittentheke avatar Jul 19 '23 08:07 frittentheke

As far as I can tell, the fix was backported May 5 (#9403) and 2.8.2 was May 3, so it will be in the next release.

remram44 avatar Jul 19 '23 16:07 remram44

Is anyone still experiencing this issue on the latest 2.8.x versions?

cstyan avatar Nov 08 '23 01:11 cstyan

version 2.9.2 the problem is still there

tifling85 avatar Nov 19 '23 09:11 tifling85

Can confirm that the 2.9.2 is affected. I have reports that some 2.9.2 promtails are experiencing this behaviour more often than on 2.6.4 before.

aabramov-ionos avatar Feb 13 '24 13:02 aabramov-ionos