Ruby uplift 2.5 -> 3.3 makes Fluentd startup 60x slower
Describe the bug
I am running fluentd in a docker container, and noticed that when I uplift ruby from version 2.5 (provided by the OS) to version 3.3 (built by me), then the startup time of fluentd increases significantly (from <1 second to multiple minutes).
The main bottleneck is the CPU:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19 normali+ 20 0 479604 20616 8088 R 99.45 0.031 0:17.68 ruby /usr/local/bin/fluentd -vv -c /fluentd/fluent.conf
and then later:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
156 normali+ 20 0 487152 27128 8436 R 99.67 0.041 0:43.26 /usr/local/bin/ruby -Eascii-8bit:ascii-8bit /usr/local/bin/fluentd -vv -c /fluentd/fluent.conf --under-supervisor
154 normali+ 20 0 487232 27504 8552 R 98.34 0.042 0:43.53 /usr/local/bin/ruby -Eascii-8bit:ascii-8bit /usr/local/bin/fluentd -vv -c /fluentd/fluent.conf --under-supervisor
The high CPU load only stops once the fluentd worker is now running worker messages appear.
I ran an strace, and found that calls like the following happen hundreds of times each second:
2024-06-27T15:09:47.844482218Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=49017062}) = 0
2024-06-27T15:09:47.844783480Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=49024759}) = 0
2024-06-27T15:09:47.845002115Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=49170943}) = 0
2024-06-27T15:09:47.845227095Z clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=49231414}) = 0
To Reproduce
I was unable to reproduce the issue outside my environment, even when I tried to build ruby the same way and install the same plugins. But for me it happens very reliably, each time I start the container. I have the full strace output, but it's 260'000 lines long, and 240'000 of it are just those clock_gettime() calls.
Your Environment
- Fluentd version: 1.15.3
- Operating system: SLES 15 SP5
- Kernel version: 5.15.0-105-generic
Note: Uplifting fluentd to v1.17.0 does not help.
Gemfile:
gem 'fluentd', '1.15.3'
gem 'fluent-plugin-rewrite-tag-filter', '2.4.0'
gem 'fluent-plugin-grok-parser', '2.6.2'
gem 'fluent-plugin-kafka', '0.15.3'
gem 'fluent-plugin-record-modifier', '2.1.0'
gem 'fluent-plugin-prometheus', '2.0.2'
gem 'fluent-plugin-opensearch', '1.0.4'
gem 'fluent-plugin-flatten-hash', '0.5.1'
gem 'oj', '3.12.2'
Your Configuration
I am deliberately running a simple - basically empty - config, and even with this the startup takes a very long time. It's dumped in the following error log. It also takes ~1 minute to simply execute fluentd --version.
Find the fluentd trace logs of the startup below. There is almost an entire minute between the first line (the fluentd command being issued) and the first log message from fluentd:
2024-06-27T16:09:52.030112275Z + fluentd -vv -c /fluentd/fluent.conf
2024-06-27T16:10:52.076792835Z 2024-06-27 16:10:52 +0000 [info]: fluent/log.rb:330:info: init supervisor logger path=nil rotate_age=nil rotate_size=nil
2024-06-27T16:10:52.214539070Z 2024-06-27 16:10:52 +0000 [info]: fluent/log.rb:330:info: parsing config file is succeeded path="/fluentd/fluent.conf"
2024-06-27T16:10:57.101598357Z 2024-06-27 16:10:57 +0000 [trace]: fluent/log.rb:287:trace: registered input plugin 'monitor_agent'
2024-06-27T16:10:58.726205882Z 2024-06-27 16:10:58 +0000 [trace]: fluent/log.rb:287:trace: registered metrics plugin 'local'
2024-06-27T16:10:58.739464606Z 2024-06-27 16:10:58 +0000 [debug]: fluent/log.rb:309:debug: No fluent logger for internal event
2024-06-27T16:10:58.752734175Z 2024-06-27 16:10:58 +0000 [info]: fluent/log.rb:330:info: using configuration file: <ROOT>
2024-06-27T16:10:58.753023723Z <system>
2024-06-27T16:10:58.753036875Z workers 2
2024-06-27T16:10:58.753040772Z ignore_repeated_log_interval 60s
2024-06-27T16:10:58.753128642Z ignore_same_log_interval 60s
2024-06-27T16:10:58.753519418Z </system>
2024-06-27T16:10:58.753529798Z <source>
2024-06-27T16:10:58.753534386Z @type monitor_agent
2024-06-27T16:10:58.753538031Z bind "0.0.0.0"
2024-06-27T16:10:58.753542108Z port 24220
2024-06-27T16:10:58.753644256Z </source>
2024-06-27T16:10:58.755709184Z </ROOT>
2024-06-27T16:10:58.759459327Z 2024-06-27 16:10:58 +0000 [info]: fluent/log.rb:330:info: starting fluentd-1.15.3 pid=19 ruby="3.3.1"
2024-06-27T16:10:58.777075542Z 2024-06-27 16:10:58 +0000 [info]: fluent/log.rb:330:info: spawn command to main: cmdline=["/usr/local/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/local/bin/fluentd", "-vv", "-c", "/fluentd/fluent.conf", "--under-supervisor"]
2024-06-27T16:10:58.778966486Z 2024-06-27 16:10:58 +0000 [info]: fluent/log.rb:330:info: init supervisor logger path=nil rotate_age=nil rotate_size=nil
2024-06-27T16:12:00.121158112Z 2024-06-27 16:12:00 +0000 [info]: #0 fluent/log.rb:330:info: init worker0 logger path=nil rotate_age=nil rotate_size=nil
2024-06-27T16:12:00.768508406Z 2024-06-27 16:12:00 +0000 [info]: fluent/log.rb:330:info: adding source type="monitor_agent"
2024-06-27T16:12:01.271596642Z 2024-06-27 16:12:01 +0000 [info]: #1 fluent/log.rb:330:info: init workers logger path=nil rotate_age=nil rotate_size=nil
2024-06-27T16:12:07.967666275Z 2024-06-27 16:12:07 +0000 [trace]: #0 fluent/log.rb:287:trace: registered input plugin 'monitor_agent'
2024-06-27T16:12:08.765010604Z 2024-06-27 16:12:08 +0000 [trace]: #1 fluent/log.rb:287:trace: registered input plugin 'monitor_agent'
2024-06-27T16:12:14.205541864Z 2024-06-27 16:12:14 +0000 [trace]: #0 fluent/log.rb:287:trace: registered metrics plugin 'local'
2024-06-27T16:12:14.246520817Z 2024-06-27 16:12:14 +0000 [debug]: #0 fluent/log.rb:309:debug: No fluent logger for internal event
2024-06-27T16:12:14.246938159Z 2024-06-27 16:12:14 +0000 [info]: #0 fluent/log.rb:330:info: starting fluentd worker pid=154 ppid=19 worker=0
2024-06-27T16:12:14.247165212Z 2024-06-27 16:12:14 +0000 [debug]: #0 fluent/log.rb:309:debug: listening monitoring http server on http://0.0.0.0:24220/api/plugins for worker0
2024-06-27T16:12:14.249364282Z 2024-06-27 16:12:14 +0000 [debug]: #0 fluent/log.rb:309:debug: Start webrick HTTP server listening
2024-06-27T16:12:14.261697759Z 2024-06-27 16:12:14 +0000 [info]: #0 fluent/log.rb:330:info: fluentd worker is now running worker=0
2024-06-27T16:12:15.227378404Z 2024-06-27 16:12:15 +0000 [trace]: #1 fluent/log.rb:287:trace: registered metrics plugin 'local'
2024-06-27T16:12:15.234665606Z 2024-06-27 16:12:15 +0000 [debug]: #1 fluent/log.rb:309:debug: No fluent logger for internal event
2024-06-27T16:12:15.234926686Z 2024-06-27 16:12:15 +0000 [info]: #1 fluent/log.rb:330:info: starting fluentd worker pid=156 ppid=19 worker=1
2024-06-27T16:12:15.235206925Z 2024-06-27 16:12:15 +0000 [debug]: #1 fluent/log.rb:309:debug: listening monitoring http server on http://0.0.0.0:24221/api/plugins for worker1
2024-06-27T16:12:15.241959806Z 2024-06-27 16:12:15 +0000 [debug]: #1 fluent/log.rb:309:debug: Start webrick HTTP server listening
2024-06-27T16:12:15.247309553Z 2024-06-27 16:12:15 +0000 [info]: #1 fluent/log.rb:330:info: fluentd worker is now running worker=1
Your Error Log
Usually I don't get an error log, because things eventually start working correctly, it just takes a very long time...
When I hit Ctrl+C during fluentd startup (when it's taking 1 minute to load), I usually get a stacktrace that's very similar to this:
<internal:/usr/local/lib/ruby/3.3.0/rubygems/core_ext/kernel_require.rb>:37:in `require': Interrupt
from <internal:/usr/local/lib/ruby/3.3.0/rubygems/core_ext/kernel_require.rb>:37:in `require'
from /usr/local/bundle/ruby/3.3.0/gems/fluentd-1.15.3/lib/fluent/supervisor.rb:18:in `<top (required)>'
from <internal:/usr/local/lib/ruby/3.3.0/rubygems/core_ext/kernel_require.rb>:37:in `require'
from <internal:/usr/local/lib/ruby/3.3.0/rubygems/core_ext/kernel_require.rb>:37:in `require'
from /usr/local/bundle/ruby/3.3.0/gems/fluentd-1.15.3/lib/fluent/command/fluentd.rb:19:in `<top (required)>'
from <internal:/usr/local/lib/ruby/3.3.0/rubygems/core_ext/kernel_require.rb>:37:in `require'
from <internal:/usr/local/lib/ruby/3.3.0/rubygems/core_ext/kernel_require.rb>:37:in `require'
from /usr/local/bundle/ruby/3.3.0/gems/fluentd-1.15.3/bin/fluentd:15:in `<top (required)>'
from /usr/local/bin/fluentd:14:in `load'
from /usr/local/bin/fluentd:14:in `<main>'
Additional context
Interestingly, the first log line from fluentd always appears almost exactly 1 minute after the command was issued, it might not be a coincidence.
But even after that, the load still takes very long and the CPU usage is ~100%. For example it takes 5-10 minutes to load my actual configuration, which has a lot of rules and uses multiple plugins like Prometheus, Kafka, OpenSearch. This took only 1-2 seconds before the ruby uplift.