fluentd icon indicating copy to clipboard operation
fluentd copied to clipboard

Race condition in signal handling

Open buzztroll opened this issue 11 months ago • 2 comments

Describe the bug

If the worker process receives a TERM signal between:

https://github.com/fluent/fluentd/blob/master/lib/fluent/supervisor.rb#L603

and

https://github.com/fluent/fluentd/blob/master/lib/fluent/supervisor.rb#L615

The signal will be lost and the process will not terminate. This comes up when using systemd if the fluentd is started and then stopped very soon after. In this case systemd will not send another signal to clean up for a configurable amount of time and the process will hang in a "deactivating" state.

To Reproduce

It is difficult to reproduce because it is a very tight race (however we see it in practice multiple times a month), but if you put a sleep statement here: https://github.com/fluent/fluentd/blob/master/lib/fluent/supervisor.rb#L604 and then send a TERM signal while it is in that sleep you will see it.

Expected behavior

I would expect the process to shutdown

Your Environment

- Fluentd version: v1.13.3 and v1.16.1
- TD Agent version:  N/A
- Operating system: Linux
- Kernel version: 5.15.0

Your Configuration

<system>

    log_level debug

  </system>

Your Error Log

023-07-06 20:32:45 +0000 [info]: starting fluentd-1.13.3 pid=45636 ruby="3.1.2"

2023-07-06 20:32:45 +0000 [info]: spawn command to main:  cmdline=["/opt/unified-monitoring-agent/embedded/bin/ruby", "-Eascii-8bit:ascii-8bit", "/opt/unified-monitoring-agent/embedded/bin/fluentd", "--log", "/var/log/unified-monitoring-agent/unified-monitoring-agent.log", "--log-rotate-size", "1048576", "--log-rotate-age", "10", "--under-supervisor"]

2023-07-06 20:32:46 +0000 [info]: #0 waiting to setup signals

2023-07-06 20:32:56 +0000 [info]: #0 setting up signals

2023-07-06 20:32:56 +0000 [info]: #0 setting up handlers

2023-07-06 20:32:56 +0000 [info]: #0 waiting after signal setup

2023-07-06 20:34:07 +0000 [info]: #0 done waiting

2023-07-06 20:34:07 +0000 [debug]: #0 No fluent logger for internal event

2023-07-06 20:34:07 +0000 [info]: #0 starting fluentd worker pid=45650 ppid=45636 worker=0

2023-07-06 20:34:07 +0000 [info]: #0 fluentd worker is now running worker=0

2023-07-06 20:39:11 +0000 [info]: fluent/log.rb:330:info: Received graceful stop

2023-07-06 20:39:11 +0000 [info]: fluent/log.rb:330:info: Received graceful stop

2023-07-06 20:39:12 +0000 [info]: #0 fluentd worker is now stopping worker=0

2023-07-06 20:39:12 +0000 [info]: #0 shutting down fluentd worker worker=0

2023-07-06 20:39:12 +0000 [info]: fluent/log.rb:330:info: Worker 0 finished with status 0

2023-07-06 20:39:14 +0000 [info]: parsing config file is succeeded path="/etc/unified-monitoring-agent/unified-monitoring-agent.conf"

2023-07-06 20:39:14 +0000 [info]: gem 'fluentd' version '1.13.3'

2023-07-06 20:39:14 +0000 [info]: gem 'fluent-plugin-oracle-cri-parser' version '1.0.20230222164221'

2023-07-06 20:39:14 +0000 [info]: gem 'fluent-plugin-oracle-nested-json-parser' version '1.0.20230222164221'

2023-07-06 20:39:14 +0000 [info]: gem 'fluent-plugin-oracle-netflowipfix' version '1.0.20230222164221'

2023-07-06 20:39:14 +0000 [info]: gem 'fluent-plugin-oracle-out-datamesh' version '1.0.20230222164221'

2023-07-06 20:39:14 +0000 [info]: gem 'fluent-plugin-oracle-telemetry-common' version '1.0.20230222164221'

2023-07-06 20:39:14 +0000 [info]: gem 'fluent-plugin-oracle-telemetry-common-public' version '1.0.20230222164221'

2023-07-06 20:39:14 +0000 [info]: gem 'fluent-plugin-oracle-telemetry-grok-parser' version '1.0.20230222164221'

2023-07-06 20:39:14 +0000 [info]: gem 'fluent-plugin-oracle-telemetry-in-agent-metrics' version '1.0.20230222164221'

2023-07-06 20:39:14 +0000 [info]: gem 'fluent-plugin-oracle-telemetry-in-g1gc' version '1.0.20230222164221'

2023-07-06 20:39:14 +0000 [info]: gem 'fluent-plugin-oracle-telemetry-in-openmetrics' version '1.0.20230222164221'

2023-07-06 20:39:14 +0000 [info]: gem 'fluent-plugin-oracle-telemetry-in-openmetrics-k8s' version '1.0.20230222164221'

2023-07-06 20:39:14 +0000 [info]: gem 'fluent-plugin-oracle-telemetry-metrics-formatter' version '1.0.20230222164221'

2023-07-06 20:39:14 +0000 [info]: gem 'fluent-plugin-oracle-telemetry-out-logging' version '1.0.20230222164221'

2023-07-06 20:39:14 +0000 [info]: gem 'fluent-plugin-oracle-telemetry-out-lumberjack' version '1.0.20230222164221'

2023-07-06 20:39:14 +0000 [info]: gem 'fluent-plugin-oracle-telemetry-out-monitoring' version '1.0.20230222164221'

2023-07-06 20:39:14 +0000 [info]: gem 'fluent-plugin-oracle-telemetry-out-objectstorage' version '1.0.20230222164221'

2023-07-06 20:39:14 +0000 [info]: gem 'fluent-plugin-oracle-telemetry-out-telemetry' version '1.0.20230222164221'

2023-07-06 20:39:14 +0000 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.4.0'

2023-07-06 20:39:14 +0000 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.3.0'

2023-07-06 20:39:14 +0000 [debug]: No fluent logger for internal event

2023-07-06 20:39:14 +0000 [info]: using configuration file: <ROOT>

  <system>

    log_level debug

  </system>

</ROOT>

2023-07-06 20:39:14 +0000 [info]: starting fluentd-1.13.3 pid=47009 ruby="3.1.2"

2023-07-06 20:39:14 +0000 [info]: spawn command to main:  cmdline=["/opt/unified-monitoring-agent/embedded/bin/ruby", "-Eascii-8bit:ascii-8bit", "/opt/unified-monitoring-agent/embedded/bin/fluentd", "--log", "/var/log/unified-monitoring-agent/unified-monitoring-agent.log", "--log-rotate-size", "1048576", "--log-rotate-age", "10", "--under-supervisor"]

2023-07-06 20:39:16 +0000 [info]: #0 waiting to setup signals

2023-07-06 20:39:26 +0000 [info]: #0 setting up signals

2023-07-06 20:39:26 +0000 [info]: #0 setting up handlers

2023-07-06 20:39:26 +0000 [info]: #0 waiting after signal setup

2023-07-06 20:39:32 +0000 [info]: fluent/log.rb:330:info: Received graceful stop

2023-07-06 20:39:32 +0000 [info]: fluent/log.rb:330:info: Received graceful stop

^C

[root@delete-me unified-monitoring-agent]# systemctl status unified-monitoring-agent | cat

● unified-monitoring-agent.service - unified-monitoring-agent: Fluentd based data collector for Oracle Cloud Infrastructure

   Loaded: loaded (/usr/lib/systemd/system/unified-monitoring-agent.service; enabled; vendor preset: disabled)

   Active: deactivating (stop-sigterm) since Thu 2023-07-06 20:39:32 GMT; 12s ago

     Docs: https://docs.cloud.oracle.com/

  Process: 47121 ExecStop=/bin/kill -TERM ${MAINPID} (code=exited, status=0/SUCCESS)

Main PID: 47009 (fluentd)

    Tasks: 4 (limit: 99872)

   Memory: 68.7M (limit: 5.0G)

   CGroup: /system.slice/unified-monitoring-agent.service

           ├─47009 /opt/unified-monitoring-agent/embedded/bin/ruby /opt/unified-monitoring-agent/embedded/bin/fluentd --log /var/log/unified-monitoring-agent/unified-monitoring-agent.log --log-rotate-size 1048576 --log-rotate-age 10

           └─47016 /opt/unified-monitoring-agent/embedded/bin/ruby -Eascii-8bit:ascii-8bit /opt/unified-monitoring-agent/embedded/bin/fluentd --log /var/log/unified-monitoring-agent/unified-monitoring-agent.log --log-rotate-size 1048576 --log-rotate-age 10 --under-supervisor

Additional context

No response

buzztroll avatar Jul 06 '23 20:07 buzztroll