Fluentd workers SIGSEGV for specific Windows events channels (security and powershell events)
Hi team,
I am experiencing regular crashes from some ruby workers collecting windows events using windows event forwarding and custom event channels.
I am seeing worker 2 (WEC_Security-OS-Security-Low-3 - Specific security event IDs) and 4 (WEC_AppAndServices-OS-PowerShell - Powershell events) and worker crashing in a loop:
2024-06-05 15:09:47 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV 2024-06-05 15:09:56 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:01 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:09 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:18 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:20 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:28 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:30 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:40 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:43 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:57 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV 2024-06-05 15:11:01 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV
I suspect the windows_eventlog2 plugin to reach a limit somewhere but it's not correlated to the number of events, as neither Security low 3 nor Powershell have the biggest number of events accross all other channels.
Here is the fluentd configuration I am working with:
<system>
log_level info
workers 7
# restart_worker_interval 1s
<log>
rotate_size 1048576
</log>
</system>
<worker 0>
<source>
# @log_level trace
@type windows_eventlog2
@id windows_eventlog2 #id must be unique
channels WEC_AppAndServices-OS-AppLocker,WEC_AppAndServices-OS-DNSServer,WEC_AppAndServices-OS-RemoteAccess,WEC_AppAndServices-OS-ADFS,WEC_AppAndServices-OS-Application,WEC_Security-OS-Security-Medium
read_existing_events false
tag windows_event_logs
rate_limit 1000
read_interval 1s
render_as_xml true
<storage>
persistent false #in memory storage
</storage>
<parse>
@type none
</parse>
</source>
<filter windows_event_logs>
@type record_transformer
enable_ruby
<record>
message ${record["message"].gsub("\r\n", "\\r\\n").gsub("\t", "\\t")}
</record>
</filter>
<match windows_event_logs>
@type forward
send_timeout 60s
recover_wait 10s
hard_timeout 60s
keepalive true
keepalive_timeout 180
#compress gzip
<server>
name device_name # Set Name
host x.x.x.x of the provisioned for the client
port 10000
weight 60
</server>
<buffer>
@type memory
flush_mode immediate #no flush_interval if mode is immediate
chunk_limit_size 512m
flush_thread_count 16
total_limit_size 4GB #max total RAM allocation
</buffer>
</match>
</worker>
<worker 1>
<source>
# @log_level trace
@type windows_eventlog2
@id windows_eventlog21 #id must be unique
channels WEC_Security-OS-Security-Low-2
read_existing_events false
tag windows_event_logs
rate_limit 100
read_interval 1s
render_as_xml true
<storage>
persistent false
</storage>
<parse>
@type none
</parse>
</source>
<filter windows_event_logs>
@type record_transformer
enable_ruby
<record>
message ${record["message"].gsub("\r\n", "\\r\\n").gsub("\t", "\\t")}
</record>
</filter>
<match windows_event_logs>
@type forward
send_timeout 60s
recover_wait 10s
hard_timeout 60s
keepalive true
keepalive_timeout 180
#compress gzip
<server>
name device_name # Set Name
host x.x.x.x of the provisioned for the client
port 10000
weight 60
</server>
<buffer>
@type memory
flush_mode immediate #no flush_interval if mode is immediate
chunk_limit_size 512m
flush_thread_count 16
total_limit_size 4GB #max total RAM allocation
</buffer>
</match>
</worker>
<worker 2>
<source>
# @log_level trace
@type windows_eventlog2
@id windows_eventlog22 #id must be unique
channels WEC_Security-OS-Security-Low-3
read_existing_events false
tag windows_event_logs
rate_limit 50
read_interval 1s
render_as_xml true
<storage>
persistent false
</storage>
<parse>
@type none
</parse>
</source>
<filter windows_event_logs>
@type record_transformer
enable_ruby
<record>
message ${record["message"].gsub("\r\n", "\\r\\n").gsub("\t", "\\t")}
</record>
</filter>
<match windows_event_logs>
@type forward
send_timeout 60s
recover_wait 10s
hard_timeout 60s
keepalive true
keepalive_timeout 180
#compress gzip
<server>
name device_name # Set Name
host x.x.x.x of the provisioned for the client
port 10000
weight 60
</server>
<buffer>
@type memory
flush_mode immediate #no flush_interval if mode is immediate
chunk_limit_size 512m
flush_thread_count 16
total_limit_size 4GB #max total RAM allocation
</buffer>
</match>
</worker>
<worker 3>
<source>
@type windows_eventlog2
@id windows_eventlog23 #id must be unique
channels WEC_Security-OS-Security-High,WEC_System-OS-System,WEC_AppAndServices-OS-MicrosoftDefender
read_existing_events false
tag windows_event_logs
rate_limit 1000
read_interval 1s
render_as_xml true
<storage>
persistent false
</storage>
<parse>
@type none
</parse>
</source>
<filter windows_event_logs>
@type record_transformer
enable_ruby
<record>
message ${record["message"].gsub("\r\n", "\\r\\n").gsub("\t", "\\t")}
</record>
</filter>
<match windows_event_logs>
@type forward
send_timeout 60s
recover_wait 10s
hard_timeout 60s
keepalive true
keepalive_timeout 180
#compress gzip
<server>
name device_name # Set Name
host x.x.x.x of the provisioned for the client
port 10000
weight 60
</server>
<buffer>
@type memory
flush_mode immediate #no flush_interval if mode is immediate
chunk_limit_size 512m
flush_thread_count 16
total_limit_size 4GB #max total RAM allocation
</buffer>
</match>
</worker>
<worker 4>
<source>
@type windows_eventlog2
@id windows_eventlog24 #id must be unique
channels WEC_AppAndServices-OS-PowerShell
read_existing_events false
tag windows_event_logs
rate_limit 50
read_interval 1s
render_as_xml true
<storage>
persistent false
</storage>
<parse>
@type none
</parse>
</source>
<filter windows_event_logs>
@type record_transformer
enable_ruby
<record>
message ${record["message"].gsub("\r\n", "\\r\\n").gsub("\t", "\\t")}
</record>
</filter>
<match windows_event_logs>
@type forward
send_timeout 60s
recover_wait 10s
hard_timeout 60s
keepalive true
keepalive_timeout 180
#compress gzip
<server>
name device_name # Set Name
host x.x.x.x of the provisioned for the client
port 10000
weight 60
</server>
<buffer>
@type memory
flush_mode immediate #no flush_interval if mode is immediate
chunk_limit_size 512m
flush_thread_count 16
total_limit_size 4GB #max total RAM allocation
</buffer>
</match>
</worker>
<worker 5>
<source>
@type windows_eventlog2
@id windows_eventlog25 #id must be unique
channels WEC_Security-OS-Security-Low-1
read_existing_events false
tag windows_event_logs
rate_limit 50
read_interval 1s
render_as_xml true
<storage>
persistent false
</storage>
<parse>
@type none
</parse>
</source>
<filter windows_event_logs>
@type record_transformer
enable_ruby
<record>
message ${record["message"].gsub("\r\n", "\\r\\n").gsub("\t", "\\t")}
</record>
</filter>
<match windows_event_logs>
@type forward
send_timeout 60s
recover_wait 10s
hard_timeout 60s
keepalive true
keepalive_timeout 180
#compress gzip
<server>
name device_name
host x.x.x.x
port 10000
weight 60
</server>
<buffer>
@type memory
flush_mode immediate #no flush_interval if mode is immediate
chunk_limit_size 512m
flush_thread_count 16
total_limit_size 4GB #max total RAM allocation
</buffer>
</match>
</worker>
Could you help ?
Cheers,
Ivs
Hmm, I wonder what could be the cause... I know several cases that work without these problems, so maybe there is some unique cause. Could you please tell me the following info?
- The info about Fluentd (fluent-package? td-agent? Docker container?).
- The gem version of
fluent-plugin-windows-eventlogandwinevt_c. - Are there any resource problems such as memory exhaustion?
Hi daipom,
Thanks for the quick answer. Sure, here is the information:
- The info about Fluentd (fluent-package? td-agent? Docker container?). fluent-package-5.0.3-x64.msi
- The gem version of fluent-plugin-windows-eventlog and winevt_c. fluent-plugin-windows-eventlog (0.8.3) winevt_c (0.10.1)
- Are there any resource problems such as memory exhaustion? The Windows Server 2022 azure VM has 4 vCPU and 16 GB RAM, the average CPU usage is 20% and the average RAM usage is 4.6 GB. There is plenty of disk space for the system (84.4 GB free) and the other drive where fluentd is installed (245 GB free). There doesn't seem to be any IO bottlenecks on the disks.
Thanks! I don't see anything wrong with the environment...
I am experiencing regular crashes
So, there's some reproducibility. We need to narrow down the cause.
Is it caused by a specific event log? (Like, it contains characters that are not supposed...)
Is there a particular setting we can change to avoid this issue?
(Such as render_as_xml)
So I have set therender_as_xml to false for workers 2 and 4 but I see no improvements (I had to remove the record transformer section also) :
<worker 2>
<source>
# @log_level trace
@type windows_eventlog2
@id windows_eventlog22 #id must be unique
channels WEC_Security-OS-Security-Low-3
read_existing_events false
tag windows_event_logs
rate_limit 50
read_interval 1s
render_as_xml false
<storage>
persistent false
</storage>
<parse>
@type none
</parse>
</source>
# <filter windows_event_logs>
# @type record_transformer
# enable_ruby
# <record>
# message ${record["message"].gsub("\r\n", "\\r\\n").gsub("\t", "\\t")}
# </record>
# </filter>
<match windows_event_logs>
@type forward
send_timeout 60s
recover_wait 10s
hard_timeout 60s
keepalive true
keepalive_timeout 180
#compress gzip
<server>
name <hostname># Set FPE Name
host x.x.x.x
port 10000
weight 60
</server>
<buffer>
@type memory
flush_mode immediate #no flush_interval if mode is immediate
chunk_limit_size 512m
flush_thread_count 16
total_limit_size 4GB #max total RAM allocation
</buffer>
</match>
</worker>
I still see the workers crashing:
2024-06-10 14:23:36 +0000 [info]: spawn command to main: cmdline=["E:/opt/fluent/bin/ruby.exe", "-Eascii-8bit:ascii-8bit", "E:/opt/fluent/bin/fluentd", "-c", "E:\\opt\\fluent\\etc\\fluent\\fluentd.conf", "-o", "E:\\opt\\fluent\\fluentd.log", "-x", "fluentdwinsvc", "--under-supervisor"]
2024-06-10 14:24:03 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV
2024-06-10 14:24:03 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:24:15 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:24:22 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV
2024-06-10 14:24:24 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:24:49 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV
2024-06-10 14:24:51 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:02 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:02 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:13 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:22 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:23 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:32 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:44 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:57 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:26:07 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV
2024-06-10 14:26:09 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:26:21 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:26:35 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:26:48 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:26:58 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:27:00 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV
Is it caused by a specific event log?
I couldn't pinpoint any specific windows event. Worker 2 and 4 are configured to read .evtx files containing specific subset of security and powershell events.
The WEC_Security-OS-Security-Low-3 evtx has a majority of 4627 "Group Membership" events. The WEC_AppAndServices-OS-PowerShell evtx file has majority of event id 400 and 600 events.
I see no particular event linked to the frequency of the worker crashes. I can try suppressing events from the subscriptions one by one but perhaps you have other ideas before I go this way :) (this will be tedious).
I think this issue might be related to https://github.com/fluent-plugins-nursery/winevt_c/pull/52 (However, I haven't been able to reproduce this problem, so I don't know if it can fix this issue.)
I think this issue might be related to fluent-plugins-nursery/winevt_c#52 (However, I haven't been able to reproduce this problem, so I don't know if it can fix this issue.)
fluent-package v5.0.8 or v6.0.0 will have this fix. They will be released this summer. Please try them. If this issue is not resolved, please tell us again! Thanks!