fluentd icon indicating copy to clipboard operation
fluentd copied to clipboard

Memory leak

Open Mosibi opened this issue 3 years ago • 31 comments

Describe the bug

I noticed a memory issue, what looks to me as a leak. I had a more complex configuration, but boiled it down to this simple config just to rule out the specifics in my config. Even with this (see below) config, I observe the memory leakage.

fluentd_container_memory_usage

I run fluentd 1.12.3 from a container on a Kubernetes cluster via Helm

To Reproduce Deploy Fluentd 1.12.3 using Helm on a Kubernetes cluster using the provided config. Let it run for some time and observe the memory metrics

Expected behavior No memory leak

Your Environment Fluentd 1.12.3 on Kubernetes 1.17.7

Your Configuration

  <system>
      log_level info
      log_event_verbose true
      root_dir /tmp/fluentd-buffers/
    </system>

    <label @OUTPUT>
      <match **>
        @type file
        path /var/log/fluentd-output
        compress gzip
        <buffer>
          timekey 1d
          timekey_use_utc true
          timekey_wait 10m
        </buffer>
      </match>
    </label>

    ###
    # systemd/journald
    ###
    <source>
      @type systemd
      tag systemd
      path /var/log/journal
      read_from_head true

      <entry>
        fields_strip_underscores true
        fields_lowercase true
      </entry>

      @label @OUTPUT
    </source>

Additional context Using my more complex configuration I observe the same memory growth but there seems to be a correlation between the growth in size and the number of logs ingested. In other words, using more inputs (in my case, container logs), the (assumed) memory leak is larger.

Mosibi avatar May 28 '21 09:05 Mosibi

It seems related to #3342.

FYI: Recently, fluent-plugin-systemd 1.0.5 was released.

It fixes "Plug a memory leaks on every reload" https://github.com/fluent-plugin-systemd/fluent-plugin-systemd/pull/91

Does it reproducible even though fluent-plugin-systemd 1.0.5?

kenhys avatar May 31 '21 01:05 kenhys

I tested with the complete config and then I still see the memory leak. Today I will run with only the systemd part to see if there is a change.

Mosibi avatar Jun 01 '21 06:06 Mosibi

I'm investigating similar issue from few days, and observed that for fluentd 1.11.5 using null output with file buffer also constantly takes more memory in opposition to null output without buffer section. I can reply tests for newer fluentd

sumo-drosiek avatar Jun 01 '21 09:06 sumo-drosiek

I'm investigating similar issue from few days, and observed that for fluentd 1.11.5 using null output with file buffer also constantly takes more memory in opposition to null output without buffer section. I can reply tests for newer fluentd

Ah, that is good information. Please share the results with the latest fluentd version, if you can.

Is it possible to share the configuration you are using?

Mosibi avatar Jun 01 '21 12:06 Mosibi

This is my actual config

2021-06-01 12:30:47 +0000 [info]: using configuration file: <ROOT>
  <match fluentd.pod.healthcheck>
    @type relabel
    @label @FLUENT_LOG
  </match>
  <label @FLUENT_LOG>
    <match **>
      @type null
    </match>
  </label>
  <system>
    log_level info
  </system>
  <source>
    @type forward
    port 24321
    bind "0.0.0.0"
  </source>
  <match containers.**>
    @type relabel
    @label @NORMAL
  </match>
  <label @NORMAL>
    <match containers.**>
      @type copy
      <store>
        @type "null"
        <buffer>
          @type "file"
          path "/fluentd/buffer/logs.containers"
          compress gzip
          flush_interval 5s
          flush_thread_count 8
          chunk_limit_size 1m
          total_limit_size 128m
          queued_chunks_limit_size 128
          overflow_action drop_oldest_chunk
          retry_max_interval 10m
          retry_forever true
        </buffer>
      </store>
    </match>
  </label>
</ROOT>

sumo-drosiek avatar Jun 01 '21 12:06 sumo-drosiek

It seems related to #3342.

FYI: Recently, fluent-plugin-systemd 1.0.5 was released.

It fixes "Plug a memory leaks on every reload" fluent-plugin-systemd/fluent-plugin-systemd#91

Does it reproducible even though fluent-plugin-systemd 1.0.5?

Yes, also with fluent-plugin-systemd version 1.0.5 I see the same behaviour. I can attach a new graph, but it shows exactly the same as the first one.

Mosibi avatar Jun 01 '21 14:06 Mosibi

Thanks to @sumo-drosiek, I can confirm that no memory leak occurs when I do not use a buffer section in the output. I rerun everything using a new config with a buffer section to be 100% sure.

The config without buffers

<system>
     log_level info
     log_event_verbose true
     root_dir /tmp/fluentd-buffers/
   </system>

   <label @OUTPUT>
     <match **>
       @type null
     </match>
   </label>

   ###
   # systemd/journald
   ###
   <source>
     @type systemd
     tag systemd
     path /var/log/journal
     read_from_head true

     <entry>
       fields_strip_underscores true
       fields_lowercase true
     </entry>


     @label @OUTPUT
   </source>

06E6E73C-B55A-467E-93F5-C1957444D212

Mosibi avatar Jun 02 '21 06:06 Mosibi

I've confirmed the issue. I think it's not fluentd's issue, it's fluent-plugin-systemd's issue. Please forward the report to https://github.com/fluent-plugin-systemd/fluent-plugin-systemd

Details:

  • I cannot reproduce it by replacing <source> with other plugins like in_tail (with tailing a huge file).
  • The issue is caused by read_from_head of in_systemd. When it's specified, in_systemd tries to read whole entries by running a busy loop: https://github.com/fluent-plugin-systemd/fluent-plugin-systemd/blob/9ffe8e14de75ca94e9bfe7428efd5c1a59421511/lib/fluent/plugin/in_systemd.rb#L144
  • Because of the busy loop, Ruby interpreter cannot run garbage collection until reach to the end of journals
  • I confirmed that it can be solved by running GC forcedly like this (it's too ad-hoc though):
diff --git a/lib/fluent/plugin/in_systemd.rb b/lib/fluent/plugin/in_systemd.rb
index adb8b3f..66f4373 100644
--- a/lib/fluent/plugin/in_systemd.rb
+++ b/lib/fluent/plugin/in_systemd.rb
@@ -120,6 +120,7 @@ module Fluent
         init_journal if @journal.wait(0) == :invalidate
         watch do |entry|
           emit(entry)
+          GC.start
         end
       end

ashie avatar Jun 03 '21 06:06 ashie

@ashie I just wanted to add the results of my test run with buffering configured in an output where I see that without buffering there is no memory leak and with buffering a memory leak occurs

6F12048C-F14B-4F43-B6F9-41D57A510EC1

Mosibi avatar Jun 03 '21 06:06 Mosibi

So to be complete:

No memory leak:

 <label @OUTPUT>
     <match **>
       @type null
     </match>
   </label>

Memory leak:

 <label @OUTPUT>
      <match **>
        @type null

        <buffer>
          @type "file"
          path "/var/log/fluentd-buffers/output.buffer"
          compress gzip
          flush_interval 5s
          flush_thread_count 8
          chunk_limit_size 1m
          total_limit_size 128m
          queued_chunks_limit_size 128
          overflow_action drop_oldest_chunk
          retry_max_interval 10m
          retry_forever true
        </buffer>
      </match>
    </label>

Mosibi avatar Jun 03 '21 06:06 Mosibi

Thanks, further investigation may be needed at fluentd side. Reopen.

ashie avatar Jun 03 '21 06:06 ashie

The following graph displays the memory over almost 7 days. This is my regular config where only the buffer sections in the Elasticsearch outputs is disabled. I was already sure that it was the buffer section, this is just more proof :)

fluentd-complete-config-without-buffers-2

Mosibi avatar Jun 10 '21 06:06 Mosibi

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

github-actions[bot] avatar Sep 08 '21 10:09 github-actions[bot]

@ashie any update on this?

sumo-drosiek avatar Sep 08 '21 10:09 sumo-drosiek

I am also facing this issue. Support has suggested to setup a CRON to restart service until a resolution comes: https://stackoverflow.com/questions/69161295/gcp-monitoring-agent-increasing-memory-usage-continuously.

komalpcg avatar Sep 24 '21 05:09 komalpcg

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

github-actions[bot] avatar Dec 23 '21 10:12 github-actions[bot]

The issue is still here. Please unstall .

levleontiev avatar Jan 20 '22 17:01 levleontiev

Can it be connected with #3634?

kvokka avatar Feb 20 '22 07:02 kvokka

Any update for this? I faced this issue with file buffer on output config

tzulin-chin avatar Mar 15 '22 01:03 tzulin-chin

for me it was a salvation in setting

            prefer_oj_serializer true
            http_backend typhoeus
            buffer_type memory

for elasticsearch match section.

And the aggregator pod is using all the ram which you give to the limit until ram release the ram. also I added

    extraEnvVars:
    # https://brandonhilkert.com/blog/reducing-sidekiq-memory-usage-with-jemalloc/?utm_source=reddit&utm_medium=social&utm_campaign=jemalloc
    - name: LD_PRELOAD
      value: /usr/lib/x86_64-linux-gnu/libjemalloc.so
    # https://www.speedshop.co/2017/12/04/malloc-doubles-ruby-memory.html
    - name: MALLOC_ARENA_MAX
      value: "2"

This helped to make this usable, but I still have no idea why it is that greedy with the ram

(by the way, forwarders are acting normal)

ofc I had to use manually set elasticsearch version to 7.17

I wasted tons of time on it and I'll be happy if some hints from these thoughts would be useful

kvokka avatar Mar 15 '22 02:03 kvokka

    # https://www.speedshop.co/2017/12/04/malloc-doubles-ruby-memory.html
    - name: MALLOC_ARENA_MAX
      value: "2"

When you use jemalloc, MALLOC_ARENA_MAX doesn't make sense, so you can remove it.

ashie avatar Mar 15 '22 02:03 ashie

I face memory leak as well I use input:

  • forward
  • http for the liveness and readiness probes
  • systemd

output:

  • null for fluent logs
  • opensearch with file buffer
  • stdout for health checks.

The file buffer is always flushed but memory is getting increased. image

Version:

activesupport (7.0.2.3)
addressable (2.8.0)
aws-eventstream (1.2.0)
aws-partitions (1.579.0)
aws-sdk-core (3.130.2)
aws-sdk-kms (1.56.0)
aws-sdk-s3 (1.113.1)
aws-sdk-sqs (1.51.0)
aws-sigv4 (1.5.0)
benchmark (default: 0.1.0)
bigdecimal (default: 2.0.0)
bundler (2.3.12, 2.3.11)
cgi (default: 0.1.0.1)
concurrent-ruby (1.1.10)
cool.io (1.7.1)
csv (default: 3.1.2)
date (default: 3.0.3)
delegate (default: 0.1.0)
did_you_mean (default: 1.4.0)
digest-crc (0.6.4)
domain_name (0.5.20190701)
elastic-transport (8.0.0)
elasticsearch (8.1.2)
elasticsearch-api (8.1.2)
elasticsearch-xpack (7.17.1)
etc (default: 1.1.0)
excon (0.92.2)
faraday (1.10.0)
faraday-em_http (1.0.0)
faraday-em_synchrony (1.0.0)
faraday-excon (1.1.0)
faraday-httpclient (1.0.1)
faraday-multipart (1.0.3)
faraday-net_http (1.0.1)
faraday-net_http_persistent (1.2.0)
faraday-patron (1.0.0)
faraday-rack (1.0.0)
faraday-retry (1.0.3)
faraday_middleware-aws-sigv4 (0.6.1)
fcntl (default: 1.0.0)
ffi (1.15.5)
ffi-compiler (1.0.1)
fiddle (default: 1.0.0)
fileutils (default: 1.4.1)
fluent-config-regexp-type (1.0.0)
fluent-plugin-concat (2.5.0)
fluent-plugin-dedot_filter (1.0.0)
fluent-plugin-detect-exceptions (0.0.14)
fluent-plugin-elasticsearch (5.2.2)
fluent-plugin-grafana-loki (1.2.18)
fluent-plugin-kafka (0.17.5)
fluent-plugin-kubernetes_metadata_filter (2.10.0)
fluent-plugin-multi-format-parser (1.0.0)
fluent-plugin-opensearch (1.0.2)
fluent-plugin-prometheus (2.0.2)
fluent-plugin-record-modifier (2.1.0)
fluent-plugin-rewrite-tag-filter (2.4.0)
fluent-plugin-s3 (1.6.1)
fluent-plugin-systemd (1.0.5)
fluentd (1.14.6)
forwardable (default: 1.3.1)
getoptlong (default: 0.1.0)
http (4.4.1)
http-accept (1.7.0)
http-cookie (1.0.4)
http-form_data (2.3.0)
http-parser (1.2.3)
http_parser.rb (0.8.0)
i18n (1.10.0)
io-console (default: 0.5.6)
ipaddr (default: 1.2.2)
irb (default: 1.2.6)
jmespath (1.6.1)
json (default: 2.3.0, 2.1.0)
jsonpath (1.1.2)
kubeclient (4.9.3)
logger (default: 1.4.2)
lru_redux (1.1.0)
ltsv (0.1.2)
matrix (default: 0.2.0)
mime-types (3.4.1)
mime-types-data (3.2022.0105)
minitest (5.15.0, 5.13.0)
msgpack (1.5.1)
multi_json (1.15.0)
multipart-post (2.1.1)
mutex_m (default: 0.1.0)
net-pop (default: 0.1.0)
net-smtp (default: 0.1.0)
net-telnet (0.2.0)
netrc (0.11.0)
observer (default: 0.1.0)
oj (3.3.10)
open3 (default: 0.1.0)
opensearch-api (1.0.0)
opensearch-ruby (1.0.0)
opensearch-transport (1.0.0)
openssl (default: 2.1.3)
ostruct (default: 0.2.0)
power_assert (1.1.7)
prime (default: 0.1.1)
prometheus-client (4.0.0)
pstore (default: 0.1.0)
psych (default: 3.1.0)
public_suffix (4.0.7)
racc (default: 1.4.16)
rake (13.0.6, 13.0.1)
rdoc (default: 6.2.1.1)
readline (default: 0.0.2)
readline-ext (default: 0.1.0)
recursive-open-struct (1.1.3)
reline (default: 0.1.5)
rest-client (2.1.0)
rexml (default: 3.2.3.1)
rss (default: 0.2.8)
ruby-kafka (1.4.0)
ruby2_keywords (0.0.5)
rubygems-update (3.3.11)
sdbm (default: 1.0.0)
serverengine (2.2.5)
sigdump (0.2.4)
singleton (default: 0.1.0)
stringio (default: 0.1.0)
strptime (0.2.5)
strscan (default: 1.0.3)
systemd-journal (1.4.2)
test-unit (3.3.4)
timeout (default: 0.1.0)
tracer (default: 0.1.0)
tzinfo (2.0.4)
tzinfo-data (1.2022.1)
unf (0.1.4)
unf_ext (0.0.8.1)
uri (default: 0.10.0)
webrick (1.7.0, default: 1.6.1)
xmlrpc (0.3.0)
yajl-ruby (1.4.2)
yaml (default: 0.1.0)
zlib (default: 1.1.0)

sdwerwed avatar Jun 01 '22 11:06 sdwerwed

I am also seeing a memory leak. Here is our Gemfile.lock and here is a 30 day graph showing the memory leak over several daemonset restarts. Each distinct color is one instance of fluentd servicing a single node:

Screen Shot 2022-06-24 at 10 55 51 AM

danielhoherd avatar Jun 24 '22 17:06 danielhoherd

After upgrading fluentd v1.7.4 to v1.14.6, the memory usage has been spiked from 400~500Mi to about 4Gi. And it trend keep growing slowly but infinitely day by day, even though there is NO configuration and workload changes. The following graph was regarding several hours.

image

Using plugin list:

  • output

    • forward
    • cloudwatch_logs (0.14.2)
    • elasticsearch (5.2.2)
  • input

    • tail
    • systemd (1.0.5)
  • v1.14.6: https://github.com/ViaQ/logging-fluentd/blob/5bb1c5abbc8f29519e840eb06464c4af3ee59316/fluentd/Gemfile.lock

  • v1.7.4: https://github.com/ViaQ/logging-fluentd/blob/e62e4cac12768a041df6511ca1ac631192136655/fluentd/Gemfile.lock

Is there any updates to affect memory usage between the both versions ?

bysnupy avatar Jul 07 '22 11:07 bysnupy

@Mosibi I have the same issue, I don't know if you have checked this or not, but seems to be a Kubernetes reporting issue, its reporting wrong metrics for memory, If you can shell into the pod and run free I think you will see the free memory is more than what is being reported by Kubernetes metrics itself. Kubernetes metrics is reporting memory usage including cache as well, may be thats the issue.

madanrishi avatar Sep 12 '22 17:09 madanrishi

@madanrishi do you have a link to the kubernetes issue that you think this might be?

danielhoherd avatar Sep 12 '22 18:09 danielhoherd

Any update on this issue guys? I am facing a similar issue

seyal84 avatar Sep 20 '22 01:09 seyal84

@madanrishi do you have a link to the kubernetes issue that you think this might be? @danielhoherd My bad on this, on further investigation this is not the case, still have the same issue, statefulset restart fixes the issue but thats not an ideal resolution

madanrishi avatar Sep 21 '22 16:09 madanrishi

Hi,

We have a memory leak in the following situation:

  • log_level is set to default
  • tail plugin is picking up a lot of non-matching lines (our format is set to json)

Changing the log_level to not log non-matching pattern messages fixes the memory leak (e.g. changing log_level to error)

Environment:

  • Kernel ranging from 3.10 to 5.16
  • CentOS Linux release 7.9.2009 (Core)
  • td-agent 4.4.2 fluentd 1.15.3 (e89092ce1132a933c12bb23fe8c9323c07ca81f5)

pbrzica avatar Jan 13 '23 09:01 pbrzica

Facing the same issue, observed on cloudwatch_logs, log_analytics and logzio-0.0.22 plugins, slack plugin, memory consumption of each process keeps rising from hundred megabytes to several gigabytes in a few days.

td-agent 4.4.1 fluentd 1.13.3 (https://github.com/fluent/fluentd/commit/c32842297ed2c306f1b841a8f6e55bdd0f1cb27f)

yangjiel avatar Apr 28 '23 01:04 yangjiel