agent icon indicating copy to clipboard operation
agent copied to clipboard

Grafana Agent on Windows don't send part of logs

Open PsySuck opened this issue 1 year ago • 7 comments

What's wrong?

Agent works on domain controllers and parse a lot of events in security logs. For example: Filter for logs Screenshot 2024-02-09 at 15 11 26 Result in event viewer Screenshot 2024-02-09 at 15 10 37 Result in loki Screenshot 2024-02-09 at 15 15 41

I don't see any error.

Steps to reproduce

Install agent to domain controllers. Enable various audit event. Run Agent.

System information

Windows server 2019

Software version

agent, version v0.39.2 (branch: HEAD, revision: deae3f86)

Configuration

########################################################
      - job_name: ad
        windows_events:
          bookmark_path: "C:\\Program Files\\Grafana Agent\\bookmarks\\ad_bookmark.xml"
          use_incoming_timestamp: true
          eventlog_name: "Security"
          # Filter for logs
          xpath_query: '*'
          poll_interval: 1s
        
        relabel_configs:
          - source_labels: ['channel']
            target_label: 'log_type'
            
        pipeline_stages:
          - template:
              source: computer
              template: '{{ ToLower .Value }}'
              
          - regex:
              source: computer
              expression: "^(?P<hostname>\\S+?)\\."

          - template:
              source: hostname
              template: '{{ ToLower .Value }}'
          
          - template:
              source: log_type
              template: '{{ ToLower .Value }}'
      
          - json:
              expressions:
                event_id:
                event_data:
                
          - template:
              source: filter
              template: 'notfiltered'
              
          - labels:
              event_id:
              hostname:
              log_type:

          - match:
              selector: '{event_id = "4625"}'
              stages:
              - regex:
                  source: event_data
                  expression: "Name='SubjectUserName'>(?P<subjectusername>.+?)<.+Name='TargetUserName'>(?P<targetusername>.+?)<.+Name='FailureReason'>(?P<failurereason>.+?)<.+Name='LogonType'>(?P<logontype>.+?)<.+Name='WorkstationName'>(?P<workstationname>.+?)<.+Name='IpAddress'>(?P<ipaddress>.+?)<.+" 
              - regex:
                  source:
                  expression: "Failure Reason:....(?P<failurereasontext>.+?)\\."
              
              - template:
                  source: filter
                  template: 'filtered'
              

                  
              - pack:
                  labels:
                    - subjectusername
                    - targetusername
                    - failurereason
                    - logontype
                    - workstationname
                    - ipaddress
                    - failurereasontext
                    
              - template:
                  source: service
                  template: 'ad'
                    
              - labels:
                  service:     

          - match:
              selector: '{event_id = "4624"}'
              stages:
              - regex:
                  source: event_data
                  expression: "Name='SubjectUserName'>(?P<subjectusername>.+?)<.+Name='TargetUserName'>(?P<targetusername>.+?)<.+Name='LogonType'>(?P<logontype>.+?)<.+Name='WorkstationName'>(?P<workstationname>.+?)<.+Name='IpAddress'>(?P<ipaddress>.+?)<.+"
              
              - template:
                  source: filter
                  template: 'filtered'
                  
              - labels:
                  service: ad
                  
              - pack:
                  labels:
                    - subjectusername
                    - targetusername
                    - logontype
                    - workstationname
                    - ipaddress
                    
              - template:
                  source: service
                  template: 'ad'
                    
              - labels:
                  service:
            
          - match:
              selector: '{event_id = "4740"}'
              stages:
              - regex:
                  source: event_data
                  expression: "Name='TargetUserName'>(?P<targetusername>.+?)<.+Name='TargetDomainName'>(?P<targetdomainname>.+?)<"
              
              - template:
                  source: filter
                  template: 'filtered'
              
                  
              - pack:
                  labels:
                    - targetusername
                    - targetdomainname
                    
              - template:
                  source: service
                  template: 'ad'
                    
              - labels:
                  service:
          #NPS          
          - match:
              selector: '{event_id =~ "627(2|3)"}'
              stages:
              - regex:
                  source: event_data
                  expression: "Name='SubjectUserName'>(?P<subjectusername>.+?)<.+Name='CallingStationID'>(?P<callingstationid>.+?)<.+Name='ClientIPAddress'>(?P<clientipaddress>.+?)<.+Name='NetworkPolicyName'>(?P<networkpolicyname>.+?)<"
            
              - regex:
                  source: event_data
                  expression: "Name='ClientName'>(?P<clientname>.+?)<.+"
              - match:
                  selector: '{event_id ="6272"}'
                  stages:
                  - template:
                      source: status
                      template: 'Network Policy Server granted access to a user'
                
              - match:
                  selector: '{event_id="6273"}'
                  stages:
                  - regex:
                      source: event_data
                      expression: "Name='Reason'>(?P<reason>.+?)<"
              
                  - template:
                      source: status
                      template: 'Network Policy Server denied access to a user'
                  
              - template:
                  source: filter
                  template: 'filtered'
                  

                    
              - pack:
                  labels:
                    - event_id
                    - clientipaddress
                    - networkpolicyname
                    - subjectusername
                    - callingstationid
                    - clientname
                    - clientipaddress
                    - reason
                    - status
                    
              - template:
                  source: service
                  template: 'nps'
                    
              - labels:
                  service:
                    
          #kerberos
          - match:
              selector: '{event_id =~ "47(68|69|70|71|72|73)"}'
              stages:
              - template:
                  source: filter
                  template: 'filtered'
              
              - pack:
                  labels:
                    - event_id
                    
              - template:
                  source: service
                  template: 'kerberos'
                    
              - labels:
                  service:
           
          
          
          - labels:
              filter:
          #other logs to ad service    
          - match:
              selector: '{filter = "notfiltered"}'
              stages:
              - template:
                  source: service
                  template: 'ad'
                    
              - labels:
                  service:
           
          - labelallow:
              - event_id
              - service
              - log_type
              - hostname
########################################################

Logs

No response

PsySuck avatar Feb 09 '24 14:02 PsySuck

Hello, thank you for submitting a bug report! Did the Agent run successfully with this configuration on the domain controller before? Or did it stop sending logs only recently?

In addition, are you sure that the pipeline_stages are correct? It might be worth simplifying them a bit to see if any logs are getting through.

Finally, it might be worth taking a look at localhost:12345/metrics, but I'm not sure if there are (useful) metrics for windows events.

ptodev avatar Feb 09 '24 15:02 ptodev

Hello. Sometimes events appear in loki. I selected the query from the label selector. This configuration works great, I just recently found that some events are missing.

UPD: localhost:12345/metrics did't find anything that can help.

PsySuck avatar Feb 09 '24 16:02 PsySuck

It seems like this issue, along with grafana/alloy#432 and grafana/alloy#433 might be caused by too many messages coming in from the event log. Would it be possible to limit the number of incoming messages by filtering some of them out via xpath_query? Another thing to consider is running multiple agents, each consuming a different type of windows event log?

ptodev avatar Feb 12 '24 11:02 ptodev

Hello. Actually usually i use filter and xpath_query. But I just discovered this problem using filters. Multiple agents are not a good idea. I used fluentd before and migrated from it precisely because I had to create several agents. How can this problem be diagnosed? Are there any known limits on the number of logs?

PsySuck avatar Feb 12 '24 12:02 PsySuck

How can this problem be diagnosed?

If you are using the Flow mode, you could use loki.echo after the loki processing stage so that you can see the logs coming out. You could also add an extra processing stage so that you only echo logs from 4662. This would help in knowing whether the problem is due to the writing to Loki, due to reading from the Event Log, or due to processing.

Are there any known limits on the number of logs?

I am personally not aware of any.

ptodev avatar Feb 12 '24 13:02 ptodev

I'm using static mode. Maybe this will can be used https://grafana.com/docs/loki/latest/send-data/promtail/stages/metrics/ ?

PsySuck avatar Feb 12 '24 14:02 PsySuck

Yes, I think you could use the metrics stage in this case. It would be useful to have a counter metric for the log lines which you believe are missing. If the counter is bigger than 0, then this would indicate that the Agent receives those log lines and processes them, but somehow isn't sending them.

By the way, the OTel Collector has a windows events receiver and a loki exporter. The events receiver is labeled as "alpha", so it must be still experimental. If you are interested, you could try it out and see if you get the same issues as in grafana/agent#6344, grafana/alloy#432 and grafana/alloy#433. If you find the OTel component to be more stable, we could add it as an otelcol Flow component.

ptodev avatar Feb 12 '24 16:02 ptodev

I switched to grafana agent flow. I converted config. Performance leaves much to be desired. There are a lot of events. But the agent seems unable to cope. I don't see any problems on the Loki side. Screenshot 2024-03-13 at 13 09 25

PsySuck avatar Mar 13 '24 11:03 PsySuck

After various tests, unfortunately, I came to the conclusion that the problem is in the performance of virtualization and storage. Thanks, I closed the bug.

PsySuck avatar Mar 17 '24 17:03 PsySuck