fluentd icon indicating copy to clipboard operation
fluentd copied to clipboard

Ruby uplift 2.5 -> 3.3 makes Fluentd startup 60x slower

Open dinatamas opened this issue 1 year ago • 0 comments

Describe the bug

I am running fluentd in a docker container, and noticed that when I uplift ruby from version 2.5 (provided by the OS) to version 3.3 (built by me), then the startup time of fluentd increases significantly (from <1 second to multiple minutes).

The main bottleneck is the CPU:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 19 normali+  20   0  479604  20616   8088 R 99.45 0.031   0:17.68 ruby /usr/local/bin/fluentd -vv -c /fluentd/fluent.conf

and then later:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
156 normali+  20   0  487152  27128   8436 R 99.67 0.041   0:43.26 /usr/local/bin/ruby -Eascii-8bit:ascii-8bit /usr/local/bin/fluentd -vv -c /fluentd/fluent.conf --under-supervisor
154 normali+  20   0  487232  27504   8552 R 98.34 0.042   0:43.53 /usr/local/bin/ruby -Eascii-8bit:ascii-8bit /usr/local/bin/fluentd -vv -c /fluentd/fluent.conf --under-supervisor

The high CPU load only stops once the fluentd worker is now running worker messages appear.

I ran an strace, and found that calls like the following happen hundreds of times each second:

2024-06-27T15:09:47.844482218Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=49017062}) = 0
2024-06-27T15:09:47.844783480Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=49024759}) = 0
2024-06-27T15:09:47.845002115Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=49170943}) = 0
2024-06-27T15:09:47.845227095Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=49231414}) = 0

To Reproduce

I was unable to reproduce the issue outside my environment, even when I tried to build ruby the same way and install the same plugins. But for me it happens very reliably, each time I start the container. I have the full strace output, but it's 260'000 lines long, and 240'000 of it are just those clock_gettime() calls.

Your Environment

  • Fluentd version: 1.15.3
  • Operating system: SLES 15 SP5
  • Kernel version: 5.15.0-105-generic

Note: Uplifting fluentd to v1.17.0 does not help.

Gemfile:

gem 'fluentd', '1.15.3'
gem 'fluent-plugin-rewrite-tag-filter', '2.4.0'
gem 'fluent-plugin-grok-parser', '2.6.2'
gem 'fluent-plugin-kafka', '0.15.3'
gem 'fluent-plugin-record-modifier', '2.1.0'
gem 'fluent-plugin-prometheus', '2.0.2'
gem 'fluent-plugin-opensearch', '1.0.4'
gem 'fluent-plugin-flatten-hash', '0.5.1'
gem 'oj', '3.12.2'

Your Configuration

I am deliberately running a simple - basically empty - config, and even with this the startup takes a very long time. It's dumped in the following error log. It also takes ~1 minute to simply execute fluentd --version.

Find the fluentd trace logs of the startup below. There is almost an entire minute between the first line (the fluentd command being issued) and the first log message from fluentd:

2024-06-27T16:09:52.030112275Z + fluentd -vv -c /fluentd/fluent.conf
2024-06-27T16:10:52.076792835Z 2024-06-27 16:10:52 +0000 [info]: fluent/log.rb:330:info: init supervisor logger path=nil rotate_age=nil rotate_size=nil
2024-06-27T16:10:52.214539070Z 2024-06-27 16:10:52 +0000 [info]: fluent/log.rb:330:info: parsing config file is succeeded path="/fluentd/fluent.conf"
2024-06-27T16:10:57.101598357Z 2024-06-27 16:10:57 +0000 [trace]: fluent/log.rb:287:trace: registered input plugin 'monitor_agent'
2024-06-27T16:10:58.726205882Z 2024-06-27 16:10:58 +0000 [trace]: fluent/log.rb:287:trace: registered metrics plugin 'local'
2024-06-27T16:10:58.739464606Z 2024-06-27 16:10:58 +0000 [debug]: fluent/log.rb:309:debug: No fluent logger for internal event
2024-06-27T16:10:58.752734175Z 2024-06-27 16:10:58 +0000 [info]: fluent/log.rb:330:info: using configuration file: <ROOT>
2024-06-27T16:10:58.753023723Z   <system>
2024-06-27T16:10:58.753036875Z     workers 2
2024-06-27T16:10:58.753040772Z     ignore_repeated_log_interval 60s
2024-06-27T16:10:58.753128642Z     ignore_same_log_interval 60s
2024-06-27T16:10:58.753519418Z   </system>
2024-06-27T16:10:58.753529798Z   <source>
2024-06-27T16:10:58.753534386Z     @type monitor_agent
2024-06-27T16:10:58.753538031Z     bind "0.0.0.0"
2024-06-27T16:10:58.753542108Z     port 24220
2024-06-27T16:10:58.753644256Z   </source>
2024-06-27T16:10:58.755709184Z </ROOT>
2024-06-27T16:10:58.759459327Z 2024-06-27 16:10:58 +0000 [info]: fluent/log.rb:330:info: starting fluentd-1.15.3 pid=19 ruby="3.3.1"
2024-06-27T16:10:58.777075542Z 2024-06-27 16:10:58 +0000 [info]: fluent/log.rb:330:info: spawn command to main:  cmdline=["/usr/local/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/local/bin/fluentd", "-vv", "-c", "/fluentd/fluent.conf", "--under-supervisor"]
2024-06-27T16:10:58.778966486Z 2024-06-27 16:10:58 +0000 [info]: fluent/log.rb:330:info: init supervisor logger path=nil rotate_age=nil rotate_size=nil
2024-06-27T16:12:00.121158112Z 2024-06-27 16:12:00 +0000 [info]: #0 fluent/log.rb:330:info: init worker0 logger path=nil rotate_age=nil rotate_size=nil
2024-06-27T16:12:00.768508406Z 2024-06-27 16:12:00 +0000 [info]: fluent/log.rb:330:info: adding source type="monitor_agent"
2024-06-27T16:12:01.271596642Z 2024-06-27 16:12:01 +0000 [info]: #1 fluent/log.rb:330:info: init workers logger path=nil rotate_age=nil rotate_size=nil
2024-06-27T16:12:07.967666275Z 2024-06-27 16:12:07 +0000 [trace]: #0 fluent/log.rb:287:trace: registered input plugin 'monitor_agent'
2024-06-27T16:12:08.765010604Z 2024-06-27 16:12:08 +0000 [trace]: #1 fluent/log.rb:287:trace: registered input plugin 'monitor_agent'
2024-06-27T16:12:14.205541864Z 2024-06-27 16:12:14 +0000 [trace]: #0 fluent/log.rb:287:trace: registered metrics plugin 'local'
2024-06-27T16:12:14.246520817Z 2024-06-27 16:12:14 +0000 [debug]: #0 fluent/log.rb:309:debug: No fluent logger for internal event
2024-06-27T16:12:14.246938159Z 2024-06-27 16:12:14 +0000 [info]: #0 fluent/log.rb:330:info: starting fluentd worker pid=154 ppid=19 worker=0
2024-06-27T16:12:14.247165212Z 2024-06-27 16:12:14 +0000 [debug]: #0 fluent/log.rb:309:debug: listening monitoring http server on http://0.0.0.0:24220/api/plugins for worker0
2024-06-27T16:12:14.249364282Z 2024-06-27 16:12:14 +0000 [debug]: #0 fluent/log.rb:309:debug: Start webrick HTTP server listening
2024-06-27T16:12:14.261697759Z 2024-06-27 16:12:14 +0000 [info]: #0 fluent/log.rb:330:info: fluentd worker is now running worker=0
2024-06-27T16:12:15.227378404Z 2024-06-27 16:12:15 +0000 [trace]: #1 fluent/log.rb:287:trace: registered metrics plugin 'local'
2024-06-27T16:12:15.234665606Z 2024-06-27 16:12:15 +0000 [debug]: #1 fluent/log.rb:309:debug: No fluent logger for internal event
2024-06-27T16:12:15.234926686Z 2024-06-27 16:12:15 +0000 [info]: #1 fluent/log.rb:330:info: starting fluentd worker pid=156 ppid=19 worker=1
2024-06-27T16:12:15.235206925Z 2024-06-27 16:12:15 +0000 [debug]: #1 fluent/log.rb:309:debug: listening monitoring http server on http://0.0.0.0:24221/api/plugins for worker1
2024-06-27T16:12:15.241959806Z 2024-06-27 16:12:15 +0000 [debug]: #1 fluent/log.rb:309:debug: Start webrick HTTP server listening
2024-06-27T16:12:15.247309553Z 2024-06-27 16:12:15 +0000 [info]: #1 fluent/log.rb:330:info: fluentd worker is now running worker=1

Your Error Log

Usually I don't get an error log, because things eventually start working correctly, it just takes a very long time...

When I hit Ctrl+C during fluentd startup (when it's taking 1 minute to load), I usually get a stacktrace that's very similar to this:

<internal:/usr/local/lib/ruby/3.3.0/rubygems/core_ext/kernel_require.rb>:37:in `require': Interrupt
        from <internal:/usr/local/lib/ruby/3.3.0/rubygems/core_ext/kernel_require.rb>:37:in `require'
        from /usr/local/bundle/ruby/3.3.0/gems/fluentd-1.15.3/lib/fluent/supervisor.rb:18:in `<top (required)>'
        from <internal:/usr/local/lib/ruby/3.3.0/rubygems/core_ext/kernel_require.rb>:37:in `require'
        from <internal:/usr/local/lib/ruby/3.3.0/rubygems/core_ext/kernel_require.rb>:37:in `require'
        from /usr/local/bundle/ruby/3.3.0/gems/fluentd-1.15.3/lib/fluent/command/fluentd.rb:19:in `<top (required)>'
        from <internal:/usr/local/lib/ruby/3.3.0/rubygems/core_ext/kernel_require.rb>:37:in `require'
        from <internal:/usr/local/lib/ruby/3.3.0/rubygems/core_ext/kernel_require.rb>:37:in `require'
        from /usr/local/bundle/ruby/3.3.0/gems/fluentd-1.15.3/bin/fluentd:15:in `<top (required)>'
        from /usr/local/bin/fluentd:14:in `load'
        from /usr/local/bin/fluentd:14:in `<main>'

Additional context

Interestingly, the first log line from fluentd always appears almost exactly 1 minute after the command was issued, it might not be a coincidence.

But even after that, the load still takes very long and the CPU usage is ~100%. For example it takes 5-10 minutes to load my actual configuration, which has a lot of rules and uses multiple plugins like Prometheus, Kafka, OpenSearch. This took only 1-2 seconds before the ruby uplift.

dinatamas avatar Jun 27 '24 16:06 dinatamas