agent
agent copied to clipboard
Grafana Agent on Windows don't send part of logs
What's wrong?
Agent works on domain controllers and parse a lot of events in security logs.
For example:
Filter for logs
Result in event viewer
Result in loki
I don't see any error.
Steps to reproduce
Install agent to domain controllers. Enable various audit event. Run Agent.
System information
Windows server 2019
Software version
agent, version v0.39.2 (branch: HEAD, revision: deae3f86)
Configuration
########################################################
- job_name: ad
windows_events:
bookmark_path: "C:\\Program Files\\Grafana Agent\\bookmarks\\ad_bookmark.xml"
use_incoming_timestamp: true
eventlog_name: "Security"
# Filter for logs
xpath_query: '*'
poll_interval: 1s
relabel_configs:
- source_labels: ['channel']
target_label: 'log_type'
pipeline_stages:
- template:
source: computer
template: '{{ ToLower .Value }}'
- regex:
source: computer
expression: "^(?P<hostname>\\S+?)\\."
- template:
source: hostname
template: '{{ ToLower .Value }}'
- template:
source: log_type
template: '{{ ToLower .Value }}'
- json:
expressions:
event_id:
event_data:
- template:
source: filter
template: 'notfiltered'
- labels:
event_id:
hostname:
log_type:
- match:
selector: '{event_id = "4625"}'
stages:
- regex:
source: event_data
expression: "Name='SubjectUserName'>(?P<subjectusername>.+?)<.+Name='TargetUserName'>(?P<targetusername>.+?)<.+Name='FailureReason'>(?P<failurereason>.+?)<.+Name='LogonType'>(?P<logontype>.+?)<.+Name='WorkstationName'>(?P<workstationname>.+?)<.+Name='IpAddress'>(?P<ipaddress>.+?)<.+"
- regex:
source:
expression: "Failure Reason:....(?P<failurereasontext>.+?)\\."
- template:
source: filter
template: 'filtered'
- pack:
labels:
- subjectusername
- targetusername
- failurereason
- logontype
- workstationname
- ipaddress
- failurereasontext
- template:
source: service
template: 'ad'
- labels:
service:
- match:
selector: '{event_id = "4624"}'
stages:
- regex:
source: event_data
expression: "Name='SubjectUserName'>(?P<subjectusername>.+?)<.+Name='TargetUserName'>(?P<targetusername>.+?)<.+Name='LogonType'>(?P<logontype>.+?)<.+Name='WorkstationName'>(?P<workstationname>.+?)<.+Name='IpAddress'>(?P<ipaddress>.+?)<.+"
- template:
source: filter
template: 'filtered'
- labels:
service: ad
- pack:
labels:
- subjectusername
- targetusername
- logontype
- workstationname
- ipaddress
- template:
source: service
template: 'ad'
- labels:
service:
- match:
selector: '{event_id = "4740"}'
stages:
- regex:
source: event_data
expression: "Name='TargetUserName'>(?P<targetusername>.+?)<.+Name='TargetDomainName'>(?P<targetdomainname>.+?)<"
- template:
source: filter
template: 'filtered'
- pack:
labels:
- targetusername
- targetdomainname
- template:
source: service
template: 'ad'
- labels:
service:
#NPS
- match:
selector: '{event_id =~ "627(2|3)"}'
stages:
- regex:
source: event_data
expression: "Name='SubjectUserName'>(?P<subjectusername>.+?)<.+Name='CallingStationID'>(?P<callingstationid>.+?)<.+Name='ClientIPAddress'>(?P<clientipaddress>.+?)<.+Name='NetworkPolicyName'>(?P<networkpolicyname>.+?)<"
- regex:
source: event_data
expression: "Name='ClientName'>(?P<clientname>.+?)<.+"
- match:
selector: '{event_id ="6272"}'
stages:
- template:
source: status
template: 'Network Policy Server granted access to a user'
- match:
selector: '{event_id="6273"}'
stages:
- regex:
source: event_data
expression: "Name='Reason'>(?P<reason>.+?)<"
- template:
source: status
template: 'Network Policy Server denied access to a user'
- template:
source: filter
template: 'filtered'
- pack:
labels:
- event_id
- clientipaddress
- networkpolicyname
- subjectusername
- callingstationid
- clientname
- clientipaddress
- reason
- status
- template:
source: service
template: 'nps'
- labels:
service:
#kerberos
- match:
selector: '{event_id =~ "47(68|69|70|71|72|73)"}'
stages:
- template:
source: filter
template: 'filtered'
- pack:
labels:
- event_id
- template:
source: service
template: 'kerberos'
- labels:
service:
- labels:
filter:
#other logs to ad service
- match:
selector: '{filter = "notfiltered"}'
stages:
- template:
source: service
template: 'ad'
- labels:
service:
- labelallow:
- event_id
- service
- log_type
- hostname
########################################################
Logs
No response
Hello, thank you for submitting a bug report! Did the Agent run successfully with this configuration on the domain controller before? Or did it stop sending logs only recently?
In addition, are you sure that the pipeline_stages
are correct? It might be worth simplifying them a bit to see if any logs are getting through.
Finally, it might be worth taking a look at localhost:12345/metrics
, but I'm not sure if there are (useful) metrics for windows events.
Hello. Sometimes events appear in loki. I selected the query from the label selector. This configuration works great, I just recently found that some events are missing.
UPD: localhost:12345/metrics did't find anything that can help.
It seems like this issue, along with grafana/alloy#432 and grafana/alloy#433 might be caused by too many messages coming in from the event log. Would it be possible to limit the number of incoming messages by filtering some of them out via xpath_query
? Another thing to consider is running multiple agents, each consuming a different type of windows event log?
Hello. Actually usually i use filter and xpath_query. But I just discovered this problem using filters. Multiple agents are not a good idea. I used fluentd before and migrated from it precisely because I had to create several agents. How can this problem be diagnosed? Are there any known limits on the number of logs?
How can this problem be diagnosed?
If you are using the Flow mode, you could use loki.echo after the loki processing stage so that you can see the logs coming out. You could also add an extra processing stage so that you only echo logs from 4662. This would help in knowing whether the problem is due to the writing to Loki, due to reading from the Event Log, or due to processing.
Are there any known limits on the number of logs?
I am personally not aware of any.
I'm using static mode. Maybe this will can be used https://grafana.com/docs/loki/latest/send-data/promtail/stages/metrics/ ?
Yes, I think you could use the metrics
stage in this case. It would be useful to have a counter metric for the log lines which you believe are missing. If the counter is bigger than 0, then this would indicate that the Agent receives those log lines and processes them, but somehow isn't sending them.
By the way, the OTel Collector has a windows events receiver and a loki exporter. The events receiver is labeled as "alpha", so it must be still experimental. If you are interested, you could try it out and see if you get the same issues as in grafana/agent#6344, grafana/alloy#432 and grafana/alloy#433. If you find the OTel component to be more stable, we could add it as an otelcol
Flow component.
I switched to grafana agent flow. I converted config.
Performance leaves much to be desired.
There are a lot of events. But the agent seems unable to cope. I don't see any problems on the Loki side.
After various tests, unfortunately, I came to the conclusion that the problem is in the performance of virtualization and storage. Thanks, I closed the bug.