fluentd icon indicating copy to clipboard operation
fluentd copied to clipboard

in_tail: Fluentd process goes unresponsive/hung and not tailing any of the log files

Open SriramDuvvuri opened this issue 1 year ago • 15 comments

Describe the bug

As commented before in the https://github.com/fluent/fluentd/issues/3614#issuecomment-1231158274, the fluentd goes unresponsive or hung.

We tried upgrading the fluentd version from v1.11.5 to v1.15.2 on one of the server where we had a issue, but still we face the issue with the latest build.

[root@aau-lgiu06fms1 06:36:28]:/var/log/td-agent> ls -ltr total 10024 drwxr-xr-x 2 td-agent td-agent 6 Aug 22 23:54 buffer -rw-r--r-- 1 td-agent td-agent 59796 Aug 25 03:17 td-agent.log-20220825.gz -rw-r----- 1 td-agent td-agent 35586 Aug 25 11:00 td-agent.log-20220826.gz -rw-r----- 1 td-agent td-agent 129378 Aug 26 19:33 td-agent.log-20220827.gz -rw-r----- 1 td-agent td-agent 878069 Aug 29 20:38 td-agent.log-20220830.gz -rw-r----- 1 td-agent td-agent 53933 Aug 31 03:47 td-agent.log-20220831.gz -rw-r--r-- 1 td-agent td-agent 306 Aug 31 15:05 t1.pos -rw-r--r-- 1 td-agent td-agent 89 Aug 31 15:13 o1.pos -rw-r--r-- 1 td-agent td-agent 73 Aug 31 15:14 s3.pos -rw-r--r-- 1 td-agent td-agent 73 Aug 31 15:14 s2.pos -rw-r--r-- 1 td-agent td-agent 96 Aug 31 15:14 b1.pos -rw-r--r-- 1 td-agent td-agent 198 Aug 31 15:14 a1.pos -rw-r--r-- 1 td-agent td-agent 19583 Aug 31 15:14 a2.pos -rw-r--r-- 1 td-agent td-agent 13033 Aug 31 15:14 a3.pos -rw-r--r-- 1 td-agent td-agent 68 Aug 31 15:14 s1.pos -rw-r----- 1 td-agent td-agent 9027630 Aug 31 18:25 td-agent.log-20220901 -rw-r----- 1 td-agent td-agent 0 Sep 1 03:23 td-agent.log

To Reproduce

No error log message found in the logs.

Expected behavior

Logs to be pushed as usual after file rotation as fluentd

Your Environment

- Fluentd version:1.11.5/1.15.2
- TD Agent version:3.8.1/4.4.1
- Operating system:RHEL7.9
- Kernel version:3.10.0-1160.36.2.el7.x86_64

Your Configuration

<source>
@type tail
  path /home/server/xx*.log
  exclude_path ["/home/server/xxyy*.log"]
  pos_file /var/log/td-agent/xxlog.pos
  tag xx.log
  enable_stat_watcher false
  read_from_head true
  follow_inodes true
  refresh_interval 5
  rotate_wait 5
  skip_refresh_on_startup true
  pos_file_compaction_interval 30m
  <parse>
    @type multiline
    format_firstline /<20/
    format1 /<(?<logDate>.*?)>-<(?<logThread>.*?)>-<(?<level>.*?)>-<(?<logClass>.*?)>-(?<logMessage>.*)/
    time_key logDate
    time_format %Y/%m/%d-%H:%M:%S.%N
  </parse>
</source>

Your Error Log

No specific error log identified apart from the pattern not matched errors

2022-09-01 06:49:17 +0100 [info]: #0 adding forwarding server 'XXXXXXXX:24224' host="XXXXXXX" port=24224 weight=60 plugin_id="object:94c"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: #0 starting fluentd worker pid=10319 ppid=10316 worker=0
2022-09-01 06:49:17 +0100 [info]: #0 delayed_commit_timeout is overwritten by ack_response_timeout
2022-09-01 06:49:17 +0100 [info]: #0 following tail of /usr/ops/e1.log
2022-09-01 06:49:17 +0100 [warn]: #0 pattern not matched: "2022-08-31 15:15:01 [/usr/ops] /usr/ops/e1.log  {}  -- "
2022-09-01 06:49:17 +0100 [warn]: #0 pattern not matched: "2022-08-31 15:15:01 [/usr/ops] :: total        used        free      shared  buff/cache   available"

Additional context

We are seeing this issue on both 1.11.5 and 1.15.2 versions.

SriramDuvvuri avatar Sep 01 '22 06:09 SriramDuvvuri

@ashie request to please let me know if any workaround or remediations available for this unresponsive nature of td-agent issue.

Thanks in advance.

SriramDuvvuri avatar Sep 06 '22 05:09 SriramDuvvuri

Could you try to get stack trace of Fluentd's worker process by sending CONT signal?

  • https://docs.fluentd.org/deployment/trouble-shooting#dump-fluentds-internal-information
  • https://github.com/fluent/sigdump#usage

ashie avatar Sep 09 '22 09:09 ashie

$ kill -CONT <pid of fluentd worker>
$ cat /tmp/sigdump-<pid of fluentd worker>.log

ashie avatar Sep 09 '22 09:09 ashie

@ashie Thanks for the response. here are the extracted sigdmp files from two instances. Please verify and let me know if you find any issues sigdmp1.txt sigdmp2.txt

Please let me know if any information needed for further investigation.

SriramDuvvuri avatar Sep 09 '22 12:09 SriramDuvvuri

@ashie did you get any chance to go thorough the above sigdmp files? Please let me know if you need any further details here to identify the issue.

SriramDuvvuri avatar Sep 14 '22 09:09 SriramDuvvuri

It seems that your log is supervisor's one, not worker's one. Fluentd launches 1 parent process (supervisor) and 1 child process (worker) by default.

[Fluentd Supervisor]
            |
            +-- [Fluentd Worker]

We need woker's log. If you don't know which process is worker, please get logs for all fluentd processes in your system.

ashie avatar Sep 14 '22 09:09 ashie

sigdump-L2G.log sigdump-N3S.log @ashie Please find the attached sigdmps of worker nodes. Please verify and let me know if any information is needed.

SriramDuvvuri avatar Sep 16 '22 12:09 SriramDuvvuri

sigdump-L2G.log sigdump-N3S.log @ashie Please find the attached sigdmps of worker nodes. Please verify and let me know if any information is needed.

@ashie Can you verify these files and let me know if you find any trace of issue in these files?

Thanks in advance.

SriramDuvvuri avatar Sep 19 '22 14:09 SriramDuvvuri

It seems that fluentd process isn't stuck, probably it's just waiting next line. Now I've noticed that you are using multiline parser. In this case, in_tail won't emit event until next <20 line is arrived. How about using multiline_flush_interval?

ashie avatar Sep 20 '22 12:09 ashie

It seems that fluentd process isn't stuck, probably it's just waiting next line. Now I've noticed that you are using multiline parser. In this case, in_tail won't emit event until next <20 line is arrived. How about using multiline_flush_interval?

@ashie I tried placing the multiline_flush_interval 5s, but still unresponsiveness of td-agent.log continues as shown in these snippets:

image

The log files shows this one line:

image

Do I need to modify any thing in the output config as well?

<match *.log> @type forward require_ack_response true expire_dns_cache 300 dns_round_robin true transport tls tls_cert_path /etc/td-agent/ca-bundle.crt keepalive true keepalive_timeout 300s #include_timestamp true host XXXXXXXXXXX port XXXXX flush_interval 5s

SriramDuvvuri avatar Sep 21 '22 10:09 SriramDuvvuri

It seems that fluentd process isn't stuck, probably it's just waiting next line. Now I've noticed that you are using multiline parser. In this case, in_tail won't emit event until next <20 line is arrived. How about using multiline_flush_interval?

@ashie I tried placing the multiline_flush_interval 5s, but still unresponsiveness of td-agent.log continues as shown in these snippets:

image

The log files shows this one line:

image

Do I need to modify any thing in the output config as well?

<match *.log> @type forward require_ack_response true expire_dns_cache 300 dns_round_robin true transport tls tls_cert_path /etc/td-agent/ca-bundle.crt keepalive true keepalive_timeout 300s #include_timestamp true host XXXXXXXXXXX port XXXXX flush_interval 5s

@ashie any further I can try to see if that can resolve the issue.

SriramDuvvuri avatar Sep 28 '22 11:09 SriramDuvvuri

It seems that fluentd process isn't stuck, probably it's just waiting next line. Now I've noticed that you are using multiline parser. In this case, in_tail won't emit event until next <20 line is arrived. How about using multiline_flush_interval?

@ashie I tried placing the multiline_flush_interval 5s, but still unresponsiveness of td-agent.log continues as shown in these snippets: image The log files shows this one line: image Do I need to modify any thing in the output config as well? <match *.log> @type forward require_ack_response true expire_dns_cache 300 dns_round_robin true transport tls tls_cert_path /etc/td-agent/ca-bundle.crt keepalive true keepalive_timeout 300s #include_timestamp true host XXXXXXXXXXX port XXXXX flush_interval 5s

@ashie any further I can try to see if that can resolve the issue.

@ashie Any update on this to mitigate the issue?

SriramDuvvuri avatar Jan 25 '23 19:01 SriramDuvvuri

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 7 days

github-actions[bot] avatar Apr 11 '23 10:04 github-actions[bot]

@ashie can we get any update on this.. is there any fix from the product end.. we still face this issue and seem the log rotate file issue..

SriramDuvvuri avatar Jan 08 '24 16:01 SriramDuvvuri

Do you mean that in_tail does not collect the contents of the file at all? Restarting Fluentd does not recover? If so, #3614 may not be related. It may be possible that the config is wrong or that the file format is not supported. Could you give us the sample logs?

daipom avatar Jan 09 '24 08:01 daipom

It's been a while, so I close this. If this issue still occurs, please reopen this.

daipom avatar Apr 10 '24 07:04 daipom