fluentd in_exec with out_exec in one config create file descriptors leak

in_exec with out_exec in one config create file descriptors leak

Open LukasJerabek opened this issue 2 years ago • 22 comments

Describe the bug

Hello, I think I have found the core issue of my problems with fluentd configuration, could someone have look on this please? I have set everything according to https://docs.fluentd.org/installation/before-install. Using configuration and two scripts below I can notice that if I do these for a while:

cd /proc/<fluentd PID>/ 
find fd | wc -l

Then I can see that number of file descriptors is rising infinitely eventually resulting in memory problems and "too many open files" error because of reaching fd limit (ulimit -n). This only occures if both in_exec and out_exec are used, since if I do following swaps, the number of file descriptors stabilizes. I can do either of these and the problems will disappear.

File descriptors count is stabilized if I swap match section for this:

<match hardware.memory>
  @type file
  append true
  path {{ home }}/fluentd/test_bug2.log
  add_path_suffix false
  <buffer []>
    @type memory
    flush_mode immediate
  </buffer>
  <format>
    @type json
  </format>
</match>

or swap source and filter section for this:

<source>
  @type sample
  sample {"mem_available":"1.5","swap_total":"1.5","swap_free":"1.2","time_local":"2022-02-09T10:00:00+0100"}
  tag hardware.memory
</source>

To Reproduce

Install fluentd: gem install fluentd --version 1.14.5 --no-doc Create folder "fluentd" in home folder Exchange {{ home }} or {{ user }} for your real values in every file below. Put fluentd.conf (with configuration below), get_memory_metrics.sh, psql_memory.sh into {{ home }}/fluentd/ do chmod +x on get_memory_metrics.sh, psql_memory.sh run {{ home }}/bin/fluentd 1.14.5 -c {{ home }}/fluentd/fluentd.conf -vv

Now start checking the file descriptors in /proc/<fluentd PID>/ with: find fd | wc -l Keep cheking it for some time, youll see the count of file descriptors never stops rising.

Hopefully I did not miss anything.

Expected behavior

Number of file descriptors is not rising infinitely and not causing memory leaks.

Your Environment

- Fluentd version: 1.14.5 (installed as gem)
- TD Agent version: -
- Operating system: Rocky Linux 8
- Kernel version: 4.18.0-348.12.2.el8_5.x86_64
- Ruby version: 2.5.9

Your Configuration

<system>
  process_name fluentd_monitor
</system>

<source>
  @type exec
  tag hardware.memory
  run_interval 10s
  command {{ home }}/fluentd/get_memory_metrics.sh
  <parse>
    @type json
    types mem_available:float,swap_total:float,swap_free:float
  </parse>
</source>

<filter hardware.memory>
  @type record_transformer
  enable_ruby
  <record>
    time_local ${time.strftime('%Y-%m-%dT%H:%M:%S%z')}
  </record>
</filter>

<match hardware.memory>
  @type exec
  command {{ home }}/fluentd/psql_memory.sh first_argument second_argument
  <format>
    @type csv
    force_quotes false
    delimiter |
    fields time_local,mem_available,swap_total,swap_free
  </format>
  <buffer time>
    @type file
    path {{ home }}/fluentd/hardware.memory.buff
    timekey 50s
    timekey_wait 0s
    flush_mode lazy
    total_limit_size 200MB
    retry_type periodic  # exponential backoff is broken https://github.com/fluent/fluentd/issues/3609
    retry_max_times 0  # need to go to secondary as soon as possible https://github.com/fluent/fluentd/issues/3610
  </buffer>
  <secondary>
    @type secondary_file
    directory {{ home }}/fluentd/secondary/
    basename dump.${chunk_id}
  </secondary>
</match>

Your Error Log

No errors in fluentd.log using --v

Additional context

get_memory_metrics.sh

#!/bin/bash
mem_info=$(awk '{ \
    if (/MemAvailable:/) {mem_available=$2}; \
    if (/SwapTotal:/) {swap_total=$2}; \
    if (/SwapFree:/) {swap_free=$2}; \
    if (mem_available && swap_total && swap_free) \
      {printf "%.3f %.3f %.3f", mem_available/1024/1024, swap_total/1024/1024, swap_free/1024/1024;exit} \
  }' /proc/meminfo)
read -r mem_available swap_total swap_free<<<"${mem_info}"

printf '{"mem_available":"%s","swap_total":"%s","swap_free":"%s"}' \
"$mem_available" \
"$swap_total" \
"$swap_free"

psql_memory.sh

#!/bin/sh
echo $1 $2 $3

Feb 09 '22 14:02 LukasJerabek

It seems that it is reproducible with a more simple in_exec example. it may also reproducible for out_exec side.

<system>
  process_name fluentd_monitor
</system>

<source>
  @type exec
  tag hardware.memory
  run_interval 10s
  command PATH_TO_COMMAND
  <parse>
    @type json
    types mem_available:float,swap_total:float,swap_free:float
  </parse>
</source>

<match hardware.memory>
  @type stdout
</match>

Feb 15 '22 06:02 kenhys

I fixed get_memory_metrics.sh in my example which did not work, prinf part was missing. I am gonna try your example, but I was not able to reproduce this when having only in_exec or out_exec in config, both had to be present.

Feb 15 '22 07:02 LukasJerabek

Yea, I can confirm that your example causes this too. So its not the combination of both that causes this.

EDIT: wrong, it was too soon to judge

Feb 15 '22 08:02 LukasJerabek

I am sorry, it was too soon to judge, it has stabilized on around 174 of FDs. So my first thought still applies, I observe only the combination of both in_exec and out_exec to cause infinite rising of FDs.

Feb 15 '22 08:02 LukasJerabek

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

Jun 13 '22 10:06 github-actions[bot]

Bump.

Jun 13 '22 11:06 LukasJerabek

I checked it but can't reproduce. I modified your config a little because they doesn't work properly for me:

fluent.conf

<system>
  process_name fluentd_monitor
</system>

<source>
  @type exec
  tag hardware.memory
  run_interval 10s
  command /home/aho/Projects/Fluentd/fluentd/issues/3627-exec-memleak/get_memory_metrics.sh
  <parse>
    @type json
    types mem_available:float,swap_total:float,swap_free:float
  </parse>
</source>

<filter hardware.memory>
  @type record_transformer
  enable_ruby
  <record>
    time_local ${time.strftime('%Y-%m-%dT%H:%M:%S%z')}
  </record>
</filter>

<match hardware.memory>
  @type exec
  command /home/aho/Projects/Fluentd/fluentd/issues/3627-exec-memleak/psql_memory.sh
  <format>
    @type csv
    force_quotes false
    delimiter |
    fields time_local,mem_available,swap_total,swap_free
  </format>
  <buffer time>
    @type file
    path /home/aho/Projects/Fluentd/fluentd/issues/3627-exec-memleak/hardware.memory.buff
    timekey 50s
    timekey_wait 0s
    flush_mode lazy
    total_limit_size 200MB
    retry_type periodic  # exponential backoff is broken https://github.com/fluent/fluentd/issues/3609
    retry_max_times 0  # need to go to secondary as soon as possible https://github.com/fluent/fluentd/issues/3610
  </buffer>
</match>

get_memory_metrics.sh:

#!/bin/bash
mem_info=$(awk '{ \
    if (/MemAvailable:/) {mem_available=$2}; \
    if (/SwapTotal:/) {swap_total=$2}; \
    if (/SwapFree:/) {swap_free=$2}; \
    if (mem_available && swap_total && swap_free) \
      {printf "%.3f %.3f %.3f", mem_available/1024/1024, swap_total/1024/1024, swap_free/1024/1024;exit} \
  }' /proc/meminfo)
read -r mem_available swap_total swap_free<<<"${mem_info}"

printf '{"mem_available":"%s","swap_total":"%s","swap_free":"%s"}' \
"$mem_available" \
"$swap_total" \
"$swap_free"

(<<< doesn't work for me so I modified the shebang to use bash explicitly.)

psql_memory.sh:

#!/bin/sh
cat $1 >> /home/aho/Projects/Fluentd/fluentd/issues/3627-exec-memleak/memusage.txt

The argument of the command is a path of buffer chunk file.
Dump the contents to a file to confirm the command is called expectedly.

Environment:

Ubuntu 22.04
Ruby 3.0.2 & 2.7.5
Fluentd 1.15.1

Results:

$ sudo find /proc/2591/fd | wc -l
13
$ cat /home/aho/Projects/Fluentd/fluentd/issues/3627-exec-memleak/memusage.txt
2022-07-28T21:43:28+0900|22.107|15.999|15.983
2022-07-28T21:43:38+0900|22.107|15.999|15.983
2022-07-28T21:43:48+0900|22.107|15.999|15.983
...

Count of fd is always 13

Jul 28 '22 12:07 ashie

It might depend on installed gems (low possibility though), please try updating them.

Jul 28 '22 13:07 ashie

(<<< doesn't work for me so I modified the shebang to use bash explicitly.)

I dont see any difference in script you posted? There is bash shebang as well in my version of the script.

Only difference I see is that you are missing secondary section. Could you try that with that? The {{ home }} was meant to be rewritten according to you environment.

I will try again later and post if I still observe the problem.

Jul 28 '22 13:07 LukasJerabek

Also did you try sudo find /proc/2591/fd | wc -l in like 15 minutes after fluentd was started or only at the begging?

Jul 28 '22 13:07 LukasJerabek

Only difference I see is that you are missing secondary section. Could you try that with that?

I also tried secondary but got almost same result.

Also did you try sudo find /proc/2591/fd | wc -l in like 15 minutes after fluentd was started or only at the begging?

I tried it every 1 sec for 2 hours.

Jul 28 '22 13:07 ashie

I still see the problem even with version 1.15.1. Roughly after 10 minutes there is 90 in my case, after 20 minutes 160. 30 minutes 233 and so on...

install vagrant https://www.vagrantup.com/downloads
create VagrantFile:

# -*- mode: ruby -*-
# vi: set ft=ruby :

Vagrant.configure("2") do |config|
  config.vm.box = "rockylinux/8"
  config.vm.box_check_update = true
  config.vm.hostname = "rocky8"
  config.vm.network "private_network", ip: "192.168.56.121"
  config.vm.define "rocky8"
  config.vm.provider "virtualbox" do |vb|
    vb.name = "rocky8"
    vb.gui = false
    vb.memory = 1024
    vb.cpus = 1
    vb.customize [ "guestproperty", "set", :id, "/VirtualBox/GuestAdd/VBoxService/--timesync-set-threshold", 1000 ]
  end
  config.vm.network :forwarded_port, guest: 22, host: 4444, id: 'ssh'
  config.ssh.extra_args = ["-L", "8000:localhost:8000"]
end

vagrant up
vagrant ssh
sudo dnf -y module enable ruby:3.0
sudo dnf install gcc make cmake ruby ruby-devel
gem install fluentd --version 1.15.1 --no-doc
cd ~
mkdir fluentd
cd fluentd
create fluentd.conf, get_memory_metric.sh, psql_memory.sh from my first post above
swich {{ home }} for /home/vagrant in all files from my first post above
chmod +x on both .sh files
cd ~/fluentd
~/bin/fluentd -c fluentd.conf
start another shell (vagrant ssh from vagrant folder again)
ps aux | grep fluentd
take PID of worker:fluentd_monitor
cd /proc/<PID>
find fd | wc -l
observe forever rising number of fds

Jul 28 '22 19:07 LukasJerabek

I use vagrant in this example, but I have seen this in rented KVM cloud.

Jul 28 '22 19:07 LukasJerabek

Thanks, it reproduced on Rocky Linux 8:

[vagrant@localhost ~]$ ls -l /proc/5844/fd/
total 0
lrwx------. 1 vagrant vagrant 64 Jul 29 12:20 0 -> /dev/pts/1
lrwx------. 1 vagrant vagrant 64 Jul 29 12:20 1 -> /dev/pts/1
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 10 -> 'pipe:[58579]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 12 -> 'pipe:[58126]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 13 -> 'pipe:[58620]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 14 -> 'pipe:[58589]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 15 -> 'pipe:[58600]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 16 -> 'pipe:[58608]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 17 -> 'pipe:[58273]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 18 -> 'pipe:[58630]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 19 -> 'pipe:[58184]'
lrwx------. 1 vagrant vagrant 64 Jul 29 12:20 2 -> /dev/pts/1
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 20 -> 'pipe:[58642]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 21 -> 'pipe:[58654]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:21 22 -> 'pipe:[58674]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:21 23 -> 'pipe:[58336]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:21 24 -> 'pipe:[58697]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:21 25 -> 'pipe:[58679]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:21 26 -> 'pipe:[58297]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:21 27 -> 'pipe:[58729]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:21 28 -> 'pipe:[58739]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:21 29 -> 'pipe:[58817]'
lrwx------. 1 vagrant vagrant 64 Jul 29 12:20 3 -> 'anon_inode:[eventfd]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:22 30 -> 'pipe:[58346]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:21 31 -> 'pipe:[58322]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:22 32 -> 'pipe:[58776]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:22 33 -> 'pipe:[58784]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:22 34 -> 'pipe:[58793]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:23 36 -> 'pipe:[58826]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:23 37 -> 'pipe:[58835]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:22 38 -> 'pipe:[58802]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:23 39 -> 'pipe:[58846]'
lrwx------. 1 vagrant vagrant 64 Jul 29 12:20 4 -> 'anon_inode:[eventfd]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:23 40 -> 'pipe:[58852]'
l-wx------. 1 vagrant vagrant 64 Jul 29 12:23 41 -> /vagrant/3627-exec-memleak/hardware.memory.buff/buffer.b5e4f0b73baa41208d74aab41afc22048.log.meta
lr-x------. 1 vagrant vagrant 64 Jul 29 12:23 43 -> 'pipe:[58861]'
lrwx------. 1 vagrant vagrant 64 Jul 29 12:20 5 -> 'anon_inode:[eventpoll]'
lrwx------. 1 vagrant vagrant 64 Jul 29 12:20 6 -> 'anon_inode:[eventpoll]'
lrwx------. 1 vagrant vagrant 64 Jul 29 12:20 8 -> /vagrant/3627-exec-memleak/hardware.memory.buff/buffer.b5e4f0b73baa41208d74aab41afc22048.log
l-wx------. 1 vagrant vagrant 64 Jul 29 12:20 9 -> 'pipe:[58532]'

Jul 29 '22 12:07 ashie

Analyzed data by fluent-plugin-sigdump: fluentd-issue-3627-exec-memory-leak Objects linearly incremented

They aren't decreased even after major GC is triggered.

Jul 29 '22 14:07 ashie

With #3844 file descriptor leak seems fixed, but still following objects are linearly incremented:

fluentd-issue-3627-exec-memory-leak Objects linearly incremented (with PR#3844)

Aug 04 '22 00:08 ashie

Although file descriptor leak is resolved, not yet resolved the problem described above. Reopen this.

Aug 10 '22 06:08 ashie

Hi, any luck on this one?

Nov 22 '22 10:11 LukasJerabek

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 7 days

Apr 12 '23 10:04 github-actions[bot]

I think this is still not solved completely...

Apr 12 '23 10:04 LukasJerabek

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 7 days

May 13 '23 10:05 github-actions[bot]

I think this is still not solved completely...

May 13 '23 10:05 LukasJerabek

fluentd fluentd copied to clipboard

in_exec with out_exec in one config create file descriptors leak

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

fluentd
fluentd copied to clipboard