go-dnscollector icon indicating copy to clipboard operation
go-dnscollector copied to clipboard

file logger: buffer still full despite "chan-buffer-size" changes

Open romainw opened this issue 11 months ago • 12 comments

Many thanks for your time and effort with go-dnscollector!

We are using go-dnscollector as part of the pDNSSOC project (https://github.com/CERN-CERT/pDNSSOC)

We process incoming dnstap files for a number of resolvers with little/no control over the files size or content. We feed go-dnscollectors with dnstap files in a directory "watched" by the ingestor. Each file is about 50MB in size.

Unfortunately we suffer from a buffer full, dropping packets error. We typically loose hundreds of thousands of packets, and our logs are flooded with "buffer full" error messages.

We tried increasing the value of "chan-buffer-size" to a larger value (e.g. 10485760) but it does not change the amount of errors or the amount of lost packets in each error.

I am not sure this is a bug or just a configuration issue, or just us doing it wrong, or what we can do about this. Any advice welcome.

Current configuration:

    - name: dns_server_1
      file-ingestor:
        enable: true
        watch-dir: /var/pdnssoc_input/dns_server_1
        watch-mode: dnstap
        delete-after: true
        chan-buffer-size: 10485760
      transforms:
        filtering:
          log-queries: true
          log-replies: true
          drop-queryip-file: /etc/dnscollector/dns_servers.txt

romainw avatar Mar 05 '24 19:03 romainw

Could you share more details ?

  • the complete config file
  • version used
  • and output logs with buffer full messages

dmachard avatar Mar 05 '24 19:03 dmachard

Sure! Thanks Denis.

We are using 0.41.0.

Config:

global:
  trace:
    verbose: true
  text-format: "timestamp-rfc3339ns identity operation rcode queryip queryport family protocol length qname qtype latency"
  # default text field delimiter
  text-format-delimiter: " "
  # default text field boundary
  text-format-boundary: "\""

multiplexer:
  collectors:
    - name: dnstap-dns-server-1
      file-ingestor:
        enable: true
        watch-dir: /var/pdnssoc_input/dns-server-1
        watch-mode: dnstap
        delete-after: true
        chan-buffer-size: 10485760
      transforms:
        filtering:
          log-queries: true
          log-replies: true
          drop-queryip-file: /root/dns_servers.txt
  loggers:
    - name: filelogdomains
      logfile:
        file-path: /var/dnscollector/matches/matches_domains.json
        mode: json
      transforms:
        filtering:
          keep-fqdn-file: '/var/dnscollector/misp_domains.txt'

    - name: filelogips
      logfile:
        file-path: /var/dnscollector/matches/matches_ips.json
        mode: json
      transforms:
        filtering:
          keep-rdata-file: '/var/dnscollector/misp_ips.txt'

    - name: fileall
      logfile:
        file-path: /var/dnscollector/queries.log
        mode: json
        max-size: 100
        max-files: 100
        postrotate-command: "/var/dnscollector/postrotate_query.sh"
        postrotate-delete-success: true

  routes:
    - from: [ dnstap-dns-server-1 ]
      to: [ filelogdomains, filelogips, fileall ]

Log files:

INFO: 2024/03/06 19:07:30.196568 collector - [dnstap-dns-server-1] fileingestor - processing of [dnstap-2024-03-02-18:56:53.fstrm] terminated
INFO: 2024/03/06 19:07:30.196586 collector - [dnstap-dns-server-1] fileingestor - delete file [dnstap-2024-03-02-18:56:53.fstrm]
INFO: 2024/03/06 19:07:30.439693 collector - [dnstap-dns-server-1] fileingestor - processing of [dnstap-2024-03-02-10:56:53.fstrm] terminated
INFO: 2024/03/06 19:07:30.439721 collector - [dnstap-dns-server-1] fileingestor - delete file [dnstap-2024-03-02-10:56:53.fstrm]
INFO: 2024/03/06 19:07:30.608676 collector - [dnstap-dns-server-1] fileingestor - processing of [dnstap-2024-03-02-02:56:53.fstrm] terminated
INFO: 2024/03/06 19:07:30.608694 collector - [dnstap-dns-server-1] fileingestor - delete file [dnstap-2024-03-02-02:56:53.fstrm]
ERROR: 2024/03/06 19:07:33.763154 processor - [dnstap-dns-server-1] dnstap - logger[fileall] buffer is full, 364989 packet(s) dropped

The error would repeat over and over:

ERROR: 2024/03/03 04:16:40.594323 processor - [dnstap-dns-server-1] dnstap - logger[fileall] buffer is full, 647689 packet(s) dropped
ERROR: 2024/03/03 04:16:50.602279 processor - [dnstap-dns-server-1] dnstap - logger[fileall] buffer is full, 741568 packet(s) dropped
ERROR: 2024/03/03 04:17:00.726706 processor - [dnstap-dns-server-1] dnstap - logger[fileall] buffer is full, 665186 packet(s) dropped
ERROR: 2024/03/03 04:17:10.824914 processor - [dnstap-dns-server-1] dnstap - logger[fileall] buffer is full, 318581 packet(s) dropped
ERROR: 2024/03/03 04:17:20.825447 processor - [dnstap-dns-server-1] dnstap - logger[fileall] buffer is full, 123588 packet(s) dropped
ERROR: 2024/03/03 04:17:30.895391 processor - [dnstap-dns-server-1] dnstap - logger[fileall] buffer is full, 84076 packet(s) dropped
ERROR: 2024/03/03 04:18:43.438253 processor - [dnstap-dns-server-1] dnstap - logger[fileall] buffer is full, 232958 packet(s) dropped
ERROR: 2024/03/03 04:18:53.484891 processor - [dnstap-dns-server-1] dnstap - logger[fileall] buffer is full, 646237 packet(s) dropped
ERROR: 2024/03/03 04:19:03.654005 processor - [dnstap-dns-server-1] dnstap - logger[fileall] buffer is full, 484388 packet(s) dropped
ERROR: 2024/03/03 04:19:13.786948 processor - [dnstap-dns-server-1] dnstap - logger[fileall] buffer is full, 403251 packet(s) dropped

romainw avatar Mar 06 '24 19:03 romainw

(Just to give a volume indication, feeding on 34M dnstap file results in " dnstap - logger[fileall] buffer is full, 2003 packet(s) dropped" with the configuration file from the previous message)

romainw avatar Mar 06 '24 20:03 romainw

In your case, the bottleneck is the fileall logger, you can see that with logger[fileall] buffer is full, 647689 packet(s) dropped.This message indicates that the buffer of the fileall logger is full.

It occured because this logger is too slow regarding the amount of incoming data. To avoid to block the collector processing, the overhead of packets are dropped.

Could you try some adjustments

  • If you remove the postrotate command, the drop of packet still occurs ?
  • Try to increase the buffer of the logger fileall ? Be carefull with this parameter (big impact on memory usage)
chan-buffer-size: 131070

dmachard avatar Mar 07 '24 08:03 dmachard

Good thinking, thank you!

After testing, it turns out that:

  • The postrotate script is causing the issue -- Disabling it solves the dropped packet issue.
  • The buffer size of the logger fileall does not play a role. In fact, the default 65535 is sufficient.

For the record, the postrotate script we use is:

#!/bin/bash

BACKUP_FOLDER=/var/dnscollector/queries/$(date +%Y-%m-%d)
mkdir -p $BACKUP_FOLDER

FILE_NAME=$(basename $1 .log)

jq -c '. | {timestamp: .dnstap."timestamp-rfc3339ns", query: .dns.qname, client: .network."query-ip", server: .network."response-ip", client_id: .dnstap.identity , answers: .dns."resource-records".an }' $1 > $BACKUP_FOLDER/$FILE_NAME.json && gzip -S .gz_minified $BACKUP_FOLDER/$FILE_NAME.json

I will continue to investigate what is causing the bottleneck there and update this ticket with any possible solution.

romainw avatar Mar 08 '24 05:03 romainw

@arvchristos @romainw The best way will be that the pdnssoc-cli uses natively the 'flat-json' model generated by DNScollector to avoid a new JSON generation from an external script. Also the file logger support gzip compression.

I think that the pDNSSOC and DNScollector integration can be greatly improved to support a higher load :)

dmachard avatar Mar 08 '24 10:03 dmachard

Hey @dmachard !

The main reason why we relied on an external script for JSON generation is to be able to store only the required for correlation fields and save on size. I am not sure whether there is any transformer to remove fields from the JSON output but it seems useful for space optimization (combined with gzip of course as you correctly pointed out).

We essentially need to store minified logs so that we can run correlation on past data with new indicators of compromise, apart from the live correlation we do with various other sources that go-dnscollector supports.

arvchristos avatar Mar 08 '24 16:03 arvchristos

I am not sure whether there is any transformer to remove fields from the JSON output

Great idea, can you open a ticket to track this?

dmachard avatar Mar 08 '24 17:03 dmachard

I am not sure whether there is any transformer to remove fields from the JSON output

Great idea, can you open a ticket to track this?

Sure, here it is:

#642

arvchristos avatar Mar 08 '24 18:03 arvchristos

Custom JSON relabeling has been implemented. It's will be implemented in next release v0.43.0. The documentation is here. Perhaps it can be used to improved performance.

dmachard avatar Mar 24 '24 10:03 dmachard