Describe the bug

During heavy log volumes, e.g. >10k log entries per second, fluentd consistently drops logs. It may be related to log rotation (on Kubernetes). When I ran a load test, I see the following entries in the fluentd logs:

2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 101712298) because an existing watcher for that filepath follows a different inode: 101712295 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-mzxxh_default_logger-8bb9a8d2eb65d5c07af7e194aad99176a79941a69c06b6ae390a0d8b9dd06cf1.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 97581155) because an existing watcher for that filepath follows a different inode: 97581154 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-nrq45_default_logger-2bad2e8722fb2369996c134f02dcf4a2fff8068d43863d3f7173a56ff2a8bbd0.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 111149786) because an existing watcher for that filepath follows a different inode: 111149782 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-p4rcl_default_logger-88fb9eaab07505f6d59f03e48e2993069eba82902efe44a46098c0d7d44f24c4.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 77634742) because an existing watcher for that filepath follows a different inode: 77634741 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-ps45w_default_logger-90f54592392569f72662a2dacfdca239a907c1da4c1729f7a75bb50f56bc9663.log"

When I added follow_inodes=true and rotate_wait=0 to the container configuration, the errors went away, but large chunks of logs were still missing and the following entries appeared in the fluentd logs.

2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-hw4ds_default_logger-aba43bbd009d1652e1961dbd30ed45f09e337bfb42d3fa247b12fde7af248909.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-jtxmz_default_logger-742ba4e5339168b7b5442745705bbfed1d93c832027ca0c680b193c9c62e796f.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-kmrlv_default_logger-7682a4b64550055203e19ff9387b686e316fe4e5e7884b720dede3692659c686.log failed. Continuing without tailing it.

I am running the latest version of the fluentd kubernetes daemonset for cloudwatch, fluent/fluentd-kubernetes-daemonset:v1.17.1-debian-cloudwatch-1.2.

During the test, both memory and CPU utilization for fluentd remained fairly low.

To Reproduce

Run multiple replicas of the following program:

import multiprocessing
import os
import time
import random
import sys
from datetime import datetime


def generate_log_entry():
    log_levels = ['INFO', 'WARNING', 'ERROR', 'DEBUG']
    messages = [
        'User logged in',
        'Database connection established',
        'File not found',
        'Memory usage high',
        'Network latency detected',
        'Cache cleared',
        'API request successful',
        'Configuration updated'
    ]

    timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]
    level = random.choice(log_levels)
    message = random.choice(messages)
    pod = os.getenv("POD_NAME", "unknown")

    return f"{timestamp} {pod} [{level}] {message}"


def worker(queue):
    while True:
        log_entry = generate_log_entry()
        queue.put(log_entry)
        time.sleep(0.01)  # Small delay to prevent overwhelming the system


def logger(queue, counter):
    while True:
        log_entry = queue.get()
        with counter.get_lock():
            counter.value += 1
        print(f"[{counter.value}] {log_entry}", flush=True)


if __name__ == '__main__':
    num_processes = multiprocessing.cpu_count()

    manager = multiprocessing.Manager()
    log_queue = manager.Queue()

    # Create a shared counter
    counter = multiprocessing.Value('i', 0)

    # Start worker processes
    workers = []
    for _ in range(num_processes - 1):  # Reserve one process for logging
        p = multiprocessing.Process(target=worker, args=(log_queue,))
        p.start()
        workers.append(p)

    # Start logger process
    logger_process = multiprocessing.Process(target=logger, args=(log_queue, counter))
    logger_process.start()

    try:
        # Keep the main process running
        while True:
            time.sleep(1)
            # Print the current count every second
            print(f"Total logs emitted: {counter.value}", file=sys.stderr, flush=True)
    except KeyboardInterrupt:
        print("\nStopping log generation...", file=sys.stderr)

        # Stop worker processes
        for p in workers:
            p.terminate()
            p.join()

        # Stop logger process
        logger_process.terminate()
        logger_process.join()

        print(f"Log generation stopped. Total logs emitted: {counter.value}", file=sys.stderr)
        sys.exit(0)

Here's the deployment for the test application:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: logger-deployment
  labels:
    app: logger
spec:
  replicas: 1  # Adjust the number of replicas as needed
  selector:
    matchLabels:
      app: logger
  template:
    metadata:
      labels:
        app: logger
    spec:
      affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - logger
              topologyKey: "kubernetes.io/hostname"
      containers:
      - name: logger
        image: jicowan/logger:v3.0
        resources:
          requests:
            cpu: 4
            memory: 128Mi
          limits:
            cpu: 4
            memory: 256Mi
        env:
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name

Here's the container.conf file for fluentd:

<source>
      @type tail
      @id in_tail_container_core_logs
      @label @raw.containers
      @log_level debug
      path /var/log/containers/*fluentd-cloudwatch*.log,/var/log/containers/*aws-node*.log,/var/log/containers/*kube-proxy*.log,/var/log/containers/*kube-system*.log,/var/log/containers/cloudwatch-agent*.log,/var/log/containers/policy-manager*.log,/var/log/containers/*private-ca*.log,/var/log/containers/metrics-server*.log,/var/log/containers/rbac-controller*.log,/var/log/containers/cluster-autoscaler*.log,/var/log/containers/cwagent*.log,/var/log/containers/*prometheus*.log,/var/log/containers/*nginx*.log,/var/log/containers/*kube-state*.log
      pos_file /var/log/fluentd-core-containers.log.pos
      tag corecontainers.**
      read_from_head true
      follow_inodes true
      rotate_wait 0
      <parse>
        @type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
        time_format %Y-%m-%dT%H:%M:%S.%N%:z
      </parse>
    </source>
    <source>
      @type tail
      @id in_tail_container_logs
      @label @raw.containers
      path /var/log/containers/*.log
      exclude_path /var/log/containers/*aws-node*.log,/var/log/containers/*coredns*.log,/var/log/containers/*kube-proxy*.log,/var/log/containers/*kube-system*.log,/var/log/containers/cloudwatch-agent*.log,/var/log/containers/policy-manager*.log,/var/log/containers/*private-ca*.log,/var/log/containers/metrics-server*.log,/var/log/containers/rbac-controller*.log,/var/log/containers/cluster-autoscaler*.log,/var/log/containers/cwagent*.log,/var/log/containers/*prometheus*.log,/var/log/containers/*nginx*.log,/var/log/containers/*opa*.log,/var/log/containers/*fluentd-cloudwatch*.log,/var/log/containers/*datadog-agent*.log,/var/log/containers/*kube-state-metrics*.log,/var/log/containers/*ebs-csi-node*.log,/var/log/containers/*ebs-csi-controller*.log,/var/log/containers/*fsx-csi-node*.log,/var/log/containers/*calico-node*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag container.**
      read_from_head true
      follow_inodes true
      rotate_wait 0
      <parse>
        @type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
        time_format %Y-%m-%dT%H:%M:%S.%N%:z
      </parse>
    </source>
    <source>
      @type tail
      @id in_tail_daemonset_logs
      @label @containers
      path /var/log/containers/*opa*.log,/var/log/containers/*datadog-agent*.log,/var/log/containers/*ebs-csi-node*.log,/var/log/containers/*ebs-csi-controller*.log,/var/log/containers/*fsx-csi-node*.log,/var/log/containers/*calico-node*.log
      pos_file /var/log/daemonset.log.pos
      tag daemonset.**
      read_from_head true
      follow_inodes true
      rotate_wait 0
      <parse>
        @type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
        time_format %Y-%m-%dT%H:%M:%S.%N%:z
      </parse>
    </source>
    <label @raw.containers>
      <match **>
        @id raw.detect_exceptions
        @type detect_exceptions
        remove_tag_prefix raw
        @label @containers
        multiline_flush_interval 1s
        max_bytes 500000
        max_lines 1000
      </match>
    </label>
    <label @containers>
      <filter corecontainers.**>
        @type prometheus
        <metric>
          name fluentd_input_status_num_corecontainer_records_total
          type counter
          desc The total number of incoming corecontainer records
        </metric>
      </filter>
      <filter container.**>
        @type prometheus
        <metric>
          name fluentd_input_status_num_container_records_total
          type counter
          desc The total number of incoming container records
        </metric>
      </filter>
      <filter daemonset.**>
        @type prometheus
        <metric>
          name fluentd_input_status_num_daemonset_records_total
          type counter
          desc The total number of incoming daemonset records
        </metric>
      </filter>
      <filter **>
        @type record_transformer
        @id filter_containers_stream_transformer
        <record>
          seal_id "110628"
          cluster_name "logging"
          stream_name ${tag_parts[4]}
        </record>
      </filter>
      <filter **>
        @type kubernetes_metadata
        @id filter_kube_metadata
        @log_level error
      </filter>
      <match corecontainers.**>
        @type copy
        <store>
          @type prometheus
          <metric>
            name fluentd_output_status_num_corecontainer_records_total
            type counter
            desc The total number of outgoing corecontainer records
          </metric>
        </store>
        <store>
          @type cloudwatch_logs
          @id out_cloudwatch_logs_core_containers
          region "us-west-2"
          log_group_name "/aws/eks/logging/core-containers"
          log_stream_name_key stream_name
          remove_log_stream_name_key true
          auto_create_stream true
          <inject>
              time_key time_nanoseconds
              time_type string
              time_format %Y-%m-%dT%H:%M:%S.%N
          </inject>
          <buffer>
            flush_interval 5s
            chunk_limit_size 2m
            queued_chunks_limit_size 32
            retry_forever true
          </buffer>
        </store>
      </match>
      <match container.**>
        @type copy
        <store>
          @type prometheus
          <metric>
            name fluentd_output_status_num_container_records_total
            type counter
            desc The total number of outgoing container records
          </metric>
        </store>
        <store>
          @type cloudwatch_logs
          @id out_cloudwatch_logs_containers
          region "us-west-2"
          log_group_name "/aws/eks/logging/containers"
          log_stream_name_key stream_name
          remove_log_stream_name_key true
          auto_create_stream true
          <inject>
              time_key time_nanoseconds
              time_type string
              time_format %Y-%m-%dT%H:%M:%S.%N
          </inject>
          <buffer>
            flush_interval 5s
            chunk_limit_size 2m
            queued_chunks_limit_size 32
            retry_forever true
          </buffer>
        </store>
      </match>
      <match daemonset.**>
        @type copy
        <store>
          @type prometheus
          <metric>
            name fluentd_output_status_num_daemonset_records_total
            type counter
            desc The total number of outgoing daemonset records
          </metric>
        </store>
        <store>
          @type cloudwatch_logs
          @id out_cloudwatch_logs_daemonset
          region "us-west-2"
          log_group_name "/aws/eks/logging/daemonset"
          log_stream_name_key stream_name
          remove_log_stream_name_key true
          auto_create_stream true
          <inject>
              time_key time_nanoseconds
              time_type string
              time_format %Y-%m-%dT%H:%M:%S.%N
          </inject>
          <buffer>
            flush_interval 5s
            chunk_limit_size 2m
            queued_chunks_limit_size 32
            retry_forever true
          </buffer>
        </store>
      </match>
    </label>

Expected behavior

The test application assigns an sequence number to each log entry. I have a Python notebook that flattens the json log output, sorts the logs by sequence number, then finds gaps in the sequence. This is how I know that fluentd is dropping logs. If everything is working as it should there should be no log loss.

I ran the same tests with fluent bit and experience no log loss.

Your Environment

- Fluentd version: v1.17.1
- Package version:
- Operating system: Amazon Linux 2
- Kernel version: 5.10.225-213.878.amzn2.x86_64

Your Configuration

data:
  containers.conf: |-
    <source>
          @type tail
          @id in_tail_container_core_logs
          @label @raw.containers
          @log_level debug
          path /var/log/containers/*fluentd-cloudwatch*.log,/var/log/containers/*aws-node*.log,/var/log/containers/*kube-proxy*.log,/var/log/containers/*kube-system*.log,/var/log/containers/cloudwatch-agent*.log,/var/log/containers/policy-manager*.log,/var/log/containers/*private-ca*.log,/var/log/containers/metrics-server*.log,/var/log/containers/rbac-controller*.log,/var/log/containers/cluster-autoscaler*.log,/var/log/containers/cwagent*.log,/var/log/containers/*prometheus*.log,/var/log/containers/*nginx*.log,/var/log/containers/*kube-state*.log
          pos_file /var/log/fluentd-core-containers.log.pos
          tag corecontainers.**
          read_from_head true
          follow_inodes true
          rotate_wait 0
          <parse>
            @type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
            time_format %Y-%m-%dT%H:%M:%S.%N%:z
          </parse>
        </source>
        <source>
          @type tail
          @id in_tail_container_logs
          @label @raw.containers
          path /var/log/containers/*.log
          exclude_path /var/log/containers/*aws-node*.log,/var/log/containers/*coredns*.log,/var/log/containers/*kube-proxy*.log,/var/log/containers/*kube-system*.log,/var/log/containers/cloudwatch-agent*.log,/var/log/containers/policy-manager*.log,/var/log/containers/*private-ca*.log,/var/log/containers/metrics-server*.log,/var/log/containers/rbac-controller*.log,/var/log/containers/cluster-autoscaler*.log,/var/log/containers/cwagent*.log,/var/log/containers/*prometheus*.log,/var/log/containers/*nginx*.log,/var/log/containers/*opa*.log,/var/log/containers/*fluentd-cloudwatch*.log,/var/log/containers/*datadog-agent*.log,/var/log/containers/*kube-state-metrics*.log,/var/log/containers/*ebs-csi-node*.log,/var/log/containers/*ebs-csi-controller*.log,/var/log/containers/*fsx-csi-node*.log,/var/log/containers/*calico-node*.log
          pos_file /var/log/fluentd-containers.log.pos
          tag container.**
          read_from_head true
          follow_inodes true
          rotate_wait 0
          <parse>
            @type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
            time_format %Y-%m-%dT%H:%M:%S.%N%:z
          </parse>
        </source>
        <source>
          @type tail
          @id in_tail_daemonset_logs
          @label @containers
          path /var/log/containers/*opa*.log,/var/log/containers/*datadog-agent*.log,/var/log/containers/*ebs-csi-node*.log,/var/log/containers/*ebs-csi-controller*.log,/var/log/containers/*fsx-csi-node*.log,/var/log/containers/*calico-node*.log
          pos_file /var/log/daemonset.log.pos
          tag daemonset.**
          read_from_head true
          follow_inodes true
          rotate_wait 0
          <parse>
            @type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
            time_format %Y-%m-%dT%H:%M:%S.%N%:z
          </parse>
        </source>
        <label @raw.containers>
          <match **>
            @id raw.detect_exceptions
            @type detect_exceptions
            remove_tag_prefix raw
            @label @containers
            multiline_flush_interval 1s
            max_bytes 500000
            max_lines 1000
          </match>
        </label>
        <label @containers>
          <filter corecontainers.**>
            @type prometheus
            <metric>
              name fluentd_input_status_num_corecontainer_records_total
              type counter
              desc The total number of incoming corecontainer records
            </metric>
          </filter>
          <filter container.**>
            @type prometheus
            <metric>
              name fluentd_input_status_num_container_records_total
              type counter
              desc The total number of incoming container records
            </metric>
          </filter>
          <filter daemonset.**>
            @type prometheus
            <metric>
              name fluentd_input_status_num_daemonset_records_total
              type counter
              desc The total number of incoming daemonset records
            </metric>
          </filter>
          <filter **>
            @type record_transformer
            @id filter_containers_stream_transformer
            <record>
              seal_id "110628"
              cluster_name "logging"
              stream_name ${tag_parts[4]}
            </record>
          </filter>
          <filter **>
            @type kubernetes_metadata
            @id filter_kube_metadata
            @log_level error
          </filter>
          <match corecontainers.**>
            @type copy
            <store>
              @type prometheus
              <metric>
                name fluentd_output_status_num_corecontainer_records_total
                type counter
                desc The total number of outgoing corecontainer records
              </metric>
            </store>
            <store>
              @type cloudwatch_logs
              @id out_cloudwatch_logs_core_containers
              region "us-west-2"
              log_group_name "/aws/eks/logging/core-containers"
              log_stream_name_key stream_name
              remove_log_stream_name_key true
              auto_create_stream true
              <inject>
                  time_key time_nanoseconds
                  time_type string
                  time_format %Y-%m-%dT%H:%M:%S.%N
              </inject>
              <buffer>
                flush_interval 5s
                chunk_limit_size 2m
                queued_chunks_limit_size 32
                retry_forever true
              </buffer>
            </store>
          </match>
          <match container.**>
            @type copy
            <store>
              @type prometheus
              <metric>
                name fluentd_output_status_num_container_records_total
                type counter
                desc The total number of outgoing container records
              </metric>
            </store>
            <store>
              @type cloudwatch_logs
              @id out_cloudwatch_logs_containers
              region "us-west-2"
              log_group_name "/aws/eks/logging/containers"
              log_stream_name_key stream_name
              remove_log_stream_name_key true
              auto_create_stream true
              <inject>
                  time_key time_nanoseconds
                  time_type string
                  time_format %Y-%m-%dT%H:%M:%S.%N
              </inject>
              <buffer>
                flush_interval 5s
                chunk_limit_size 2m
                queued_chunks_limit_size 32
                retry_forever true
              </buffer>
            </store>
          </match>
          <match daemonset.**>
            @type copy
            <store>
              @type prometheus
              <metric>
                name fluentd_output_status_num_daemonset_records_total
                type counter
                desc The total number of outgoing daemonset records
              </metric>
            </store>
            <store>
              @type cloudwatch_logs
              @id out_cloudwatch_logs_daemonset
              region "us-west-2"
              log_group_name "/aws/eks/logging/daemonset"
              log_stream_name_key stream_name
              remove_log_stream_name_key true
              auto_create_stream true
              <inject>
                  time_key time_nanoseconds
                  time_type string
                  time_format %Y-%m-%dT%H:%M:%S.%N
              </inject>
              <buffer>
                flush_interval 5s
                chunk_limit_size 2m
                queued_chunks_limit_size 32
                retry_forever true
              </buffer>
            </store>
          </match>
        </label>
  fluent.conf: |
    @include containers.conf
    @include systemd.conf
    @include host.conf

    <match fluent.**>
      @type null
    </match>
  host.conf: |
    <source>
      @type tail
      @id in_tail_dmesg
      @label @hostlogs
      path /var/log/dmesg
      pos_file /var/log/dmesg.log.pos
      tag host.dmesg
      read_from_head true
      <parse>
        @type syslog
      </parse>
    </source>

    <source>
      @type tail
      @id in_tail_secure
      @label @hostlogs
      path /var/log/secure
      pos_file /var/log/secure.log.pos
      tag host.secure
      read_from_head true
      <parse>
        @type syslog
      </parse>
    </source>

    <source>
      @type tail
      @id in_tail_messages
      @label @hostlogs
      path /var/log/messages
      pos_file /var/log/messages.log.pos
      tag host.messages
      read_from_head true
      <parse>
        @type syslog
      </parse>
    </source>

    <label @hostlogs>
      <filter **>
        @type kubernetes_metadata
        @id filter_kube_metadata_host
        watch false
      </filter>

      <filter **>
        @type record_transformer
        @id filter_containers_stream_transformer_host
        <record>
          stream_name ${tag}-${record["host"]}
        </record>
      </filter>

      <match host.**>
        @type cloudwatch_logs
        @id out_cloudwatch_logs_host_logs
        region "#{ENV.fetch('AWS_REGION')}"
        log_group_name "/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/host"
        log_stream_name_key stream_name
        remove_log_stream_name_key true
        auto_create_stream true
        <buffer>
          flush_interval 5
          chunk_limit_size 2m
          queued_chunks_limit_size 32
          retry_forever true
        </buffer>
      </match>
    </label>
  kubernetes.conf: |
    kubernetes.conf
  systemd.conf: |
    <source>
      @type systemd
      @id in_systemd_kubelet
      @label @systemd
      filters [{ "_SYSTEMD_UNIT": "kubelet.service" }]
      <entry>
        field_map {"MESSAGE": "message", "_HOSTNAME": "hostname", "_SYSTEMD_UNIT": "systemd_unit"}
        field_map_strict true
      </entry>
      path /var/log/journal
      <storage>
        @type local
        persistent true
        path /var/log/fluentd-journald-kubelet-pos.json
      </storage>
      read_from_head true
      tag kubelet.service
    </source>

    <source>
      @type systemd
      @id in_systemd_kubeproxy
      @label @systemd
      filters [{ "_SYSTEMD_UNIT": "kubeproxy.service" }]
      <entry>
        field_map {"MESSAGE": "message", "_HOSTNAME": "hostname", "_SYSTEMD_UNIT": "systemd_unit"}
        field_map_strict true
      </entry>
      path /var/log/journal
      <storage>
        @type local
        persistent true
        path /var/log/fluentd-journald-kubeproxy-pos.json
      </storage>
      read_from_head true
      tag kubeproxy.service
    </source>

    <source>
      @type systemd
      @id in_systemd_docker
      @label @systemd
      filters [{ "_SYSTEMD_UNIT": "docker.service" }]
      <entry>
        field_map {"MESSAGE": "message", "_HOSTNAME": "hostname", "_SYSTEMD_UNIT": "systemd_unit"}
        field_map_strict true
      </entry>
      path /var/log/journal
      <storage>
        @type local
        persistent true
        path /var/log/fluentd-journald-docker-pos.json
      </storage>
      read_from_head true
      tag docker.service
    </source>

    <label @systemd>
      <filter **>
        @type kubernetes_metadata
        @id filter_kube_metadata_systemd
        watch false
      </filter>

      <filter **>
        @type record_transformer
        @id filter_systemd_stream_transformer
        <record>
          stream_name ${tag}-${record["hostname"]}
        </record>
      </filter>

      <match **>
        @type cloudwatch_logs
        @id out_cloudwatch_logs_systemd
        region "#{ENV.fetch('AWS_REGION')}"
        log_group_name "/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/dataplane"
        log_stream_name_key stream_name
        auto_create_stream true
        remove_log_stream_name_key true
        <buffer>
          flush_interval 5
          chunk_limit_size 2m
          queued_chunks_limit_size 32
          retry_forever true
        </buffer>
      </match>
    </label>

Your Error Log

2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 101712298) because an existing watcher for that filepath follows a different inode: 101712295 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-mzxxh_default_logger-8bb9a8d2eb65d5c07af7e194aad99176a79941a69c06b6ae390a0d8b9dd06cf1.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 97581155) because an existing watcher for that filepath follows a different inode: 97581154 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-nrq45_default_logger-2bad2e8722fb2369996c134f02dcf4a2fff8068d43863d3f7173a56ff2a8bbd0.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 111149786) because an existing watcher for that filepath follows a different inode: 111149782 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-p4rcl_default_logger-88fb9eaab07505f6d59f03e48e2993069eba82902efe44a46098c0d7d44f24c4.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 77634742) because an existing watcher for that filepath follows a different inode: 77634741 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-ps45w_default_logger-90f54592392569f72662a2dacfdca239a907c1da4c1729f7a75bb50f56bc9663.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 101712298) because an existing watcher for that filepath follows a different inode: 101712295 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-mzxxh_default_logger-8bb9a8d2eb65d5c07af7e194aad99176a79941a69c06b6ae390a0d8b9dd06cf1.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 97581155) because an existing watcher for that filepath follows a different inode: 97581154 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-nrq45_default_logger-2bad2e8722fb2369996c134f02dcf4a2fff8068d43863d3f7173a56ff2a8bbd0.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 111149786) because an existing watcher for that filepath follows a different inode: 111149782 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-p4rcl_default_logger-88fb9eaab07505f6d59f03e48e2993069eba82902efe44a46098c0d7d44f24c4.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 77634742) because an existing watcher for that filepath follows a different inode: 77634741 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-ps45w_default_logger-90f54592392569f72662a2dacfdca239a907c1da4c1729f7a75bb50f56bc9663.log"

2024-11-02 14:15:49 +0000 [warn]: #0 [in_tail_container_logs] Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/logger-deployment-57cc6745c7-ps45w_default_logger-90f54592392569f72662a2dacfdca239a907c1da4c1729f7a75bb50f56bc9663.log" inode=77634746 inode_in_pos_file=77634747

***After setting time=0 inodes=true***
2024-11-02 17:26:28 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/logger-deployment-57cc6745c7-ckckh_default_logger-1c52a92c2e1ef377d9b6c95dc693b86645d93fbfcf13832ee5337cc9ab201b0b.log
2024-11-02 17:26:32 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-ckckh_default_logger-1c52a92c2e1ef377d9b6c95dc693b86645d93fbfcf13832ee5337cc9ab201b0b.log; waiting 0.0 seconds
2024-11-02 17:26:32 +0000 [warn]: #0 [in_tail_container_logs] Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/logger-deployment-57cc6745c7-ckckh_default_logger-1c52a92c2e1ef377d9b6c95dc693b86645d93fbfcf13832ee5337cc9ab201b0b.log" inode=152064097 inode_in_pos_file=0
2024-11-02 17:26:32 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-ckckh_default_logger-1c52a92c2e1ef377d9b6c95dc693b86645d93fbfcf13832ee5337cc9ab201b0b.log; waiting 0.0 seconds
2024-11-02 17:26:32 +0000 [warn]: #0 [in_tail_container_logs] Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/logger-deployment-57cc6745c7-ckckh_default_logger-1c52a92c2e1ef377d9b6c95dc693b86645d93fbfcf13832ee5337cc9ab201b0b.log" inode=152064099 inode_in_pos_file=0
2024-11-02 17:26:32 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-zwzxv_default_logger-af33706631b5c04250aa71c6956fde092559f09f8891e007dd8d454b12e89135.log; waiting 0.0 seconds
2024-11-02 17:26:32 +0000 [warn]: #0 [in_tail_container_logs] Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/logger-deployment-57cc6745c7-zwzxv_default_logger-af33706631b5c04250aa71c6956fde092559f09f8891e007dd8d454b12e89135.log" inode=112237023 inode_in_pos_file=0

2024-11-02 17:27:48 +0000 [debug]: #0 [in_tail_container_core_logs] tailing paths: target = /var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_copy-fluentd-config-dc7b79cd11ccf90f5b8c512c1552ae13b28abfb2400b2ecd03c12d0ae7ceb564.log,/var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_fluentd-cloudwatch-bc8e8da1056c6759e099f6b5b983d44ae7940a4963e376940b3ccacb18a6ab26.log,/var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_update-log-driver-992ee8554687722124787066407ad9b21e97e3382b08a216205fda34259a0e03.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-eks-nodeagent-d43b788731adaea1b1e53e23b0cd6c6aa4c15b41afd3f61ccb4f0fe466ae8d30.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-node-688f632cd4bffd057003bcfa31b3546f4d64546e737645174cebc611f97e8e15.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-vpc-cni-init-f59e23252a414c9f2041222c095f86766775eb70d37dd3fd89690978f2f554d0.log,/var/log/containers/kube-proxy-6z8zd_kube-system_kube-proxy-a1aae65c089af12b388a0527ebf25f7418eed956da5b284dace2702d58f422df.log,/var/log/containers/coredns-787cb67946-6dfhl_kube-system_coredns-f8b53737ad2d4133a9d9ac69f9f56bfbc9e7afb54d3dc91e6f7489009365ea17.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-attacher-6530ac17c228aeca7e39958a1aa2f02da5878bf3b6b2fb643b5f43b53fcdf0b9.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-provisioner-d3d1c4db5b0837aabf2cb3676951e85bd63c8d432b47b07770ad3d226f3be522.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-resizer-ea911f783028d85009ebe185d03d602a8eb64fa2fe80da03082703caa69584d8.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_ebs-plugin-db350e781604de4725003c8f38a03f4ca2a1eec021c61005565a3caff3cd4733.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_liveness-probe-db10e53f8e6ecef8fab33ca7e68db83f3070dc406680fc4eb6858bffe431a37f.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_ebs-plugin-bb331132e02cb3ee93c1a2cf5225cd14b2b2d063846e5e1e578665d0679d23ec.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_liveness-probe-a5f50e5e9490b16833b6fed1d29caf9ccb352dbb8852ec4cf5c93781ad61afd2.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_node-driver-registrar-9d0b426f9ebb91798f1d9d444a6d728b09f926794c471229e6f5f4d54891a07a.log,/var/log/containers/eks-pod-identity-agent-p9szr_kube-system_eks-pod-identity-agent-b93a02fa5321cba6f33ca5b809c948f9469ea8ffa2f320443960009196ba520a.log,/var/log/containers/eks-pod-identity-agent-p9szr_kube-system_eks-pod-identity-agent-init-b02cdb94178b436faaaf7f9a1e97d131046b38716434e2db474b1d5026a66ff0.log | existing = /var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_copy-fluentd-config-dc7b79cd11ccf90f5b8c512c1552ae13b28abfb2400b2ecd03c12d0ae7ceb564.log,/var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_fluentd-cloudwatch-bc8e8da1056c6759e099f6b5b983d44ae7940a4963e376940b3ccacb18a6ab26.log,/var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_update-log-driver-992ee8554687722124787066407ad9b21e97e3382b08a216205fda34259a0e03.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-eks-nodeagent-d43b788731adaea1b1e53e23b0cd6c6aa4c15b41afd3f61ccb4f0fe466ae8d30.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-node-688f632cd4bffd057003bcfa31b3546f4d64546e737645174cebc611f97e8e15.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-vpc-cni-init-f59e23252a414c9f2041222c095f86766775eb70d37dd3fd89690978f2f554d0.log,/var/log/containers/kube-proxy-6z8zd_kube-system_kube-proxy-a1aae65c089af12b388a0527ebf25f7418eed956da5b284dace2702d58f422df.log,/var/log/containers/coredns-787cb67946-6dfhl_kube-system_coredns-f8b53737ad2d4133a9d9ac69f9f56bfbc9e7afb54d3dc91e6f7489009365ea17.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-attacher-6530ac17c228aeca7e39958a1aa2f02da5878bf3b6b2fb643b5f43b53fcdf0b9.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-provisioner-d3d1c4db5b0837aabf2cb3676951e85bd63c8d432b47b07770ad3d226f3be522.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-resizer-ea911f783028d85009ebe185d03d602a8eb64fa2fe80da03082703caa69584d8.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_ebs-plugin-db350e781604de4725003c8f38a03f4ca2a1eec021c61005565a3caff3cd4733.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_liveness-probe-db10e53f8e6ecef8fab33ca7e68db83f3070dc406680fc4eb6858bffe431a37f.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_ebs-plugin-bb331132e02cb3ee93c1a2cf5225cd14b2b2d063846e5e1e578665d0679d23ec.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_liveness-probe-a5f50e5e9490b16833b6fed1d29caf9ccb352dbb8852ec4cf5c93781ad61afd2.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_node-driver-registrar-9d0b426f9ebb91798f1d9d444a6d728b09f926794c471229e6f5f4d54891a07a.log,/var/log/containers/eks-pod-identity-agent-p9szr_kube-system_eks-pod-identity-agent-b93a02fa5321cba6f33ca5b809c948f9469ea8ffa2f320443960009196ba520a.log,/var/log/containers/eks-pod-identity-agent-p9szr_kube-system_eks-pod-identity-agent-init-b02cdb94178b436faaaf7f9a1e97d131046b38716434e2db474b1d5026a66ff0.log
2024-11-02 17:27:49 +0000 [debug]: #0 [in_tail_container_core_logs] tailing paths: target = /var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_copy-fluentd-config-e1c4560f70a672f811586c42239cd8f823c2da7afe504f49af7965f019091f57.log,/var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_fluentd-cloudwatch-0e493d532c0a48ae46aed7b6500431b93b0403acd74dd6ff92049c571be9e402.log,/var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_update-log-driver-a7799851e03ac287f48cbc63552c5b31016106061ba40493ad644e8a10016e62.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-eks-nodeagent-2a82275bdf85fdb8ac57a6d9e4c927919eb8472e10ffaf77a0290c291111d629.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-eks-nodeagent-a410bd11314ce2fff148d5effd863b8502f0aadf4d492c94c5d841c388b927f4.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-node-0f0417f969145e80e9de2474148256bf009ac84094d26453c53fd5c1c1b0ad6d.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-vpc-cni-init-ffcd1ff811ff67d406fe64096ef05cd9db75666ed1c8efbfbd303f7d09e3c95e.log,/var/log/containers/kube-proxy-4xl5d_kube-system_kube-proxy-32285f83bc32feb2f06700f235ff9db332b23c355b1b7c17b9deaab4a3bcf531.log,/var/log/containers/kube-proxy-4xl5d_kube-system_kube-proxy-a3726048ebd5dceb76fe36e6fadeff5010c6e242aef6bc8f73f4e935a1f4f88c.log,/var/log/containers/coredns-787cb67946-c7jg2_kube-system_coredns-170f21c4cd43ac571eadd5d2f7992734ac46ef62cfca08ae3b4dd9b0bcb7657c.log,/var/log/containers/coredns-787cb67946-c7jg2_kube-system_coredns-cd01a35e8ddbb4255538b165a64aede38b23cc6926a02dc606f7a568edd3a54d.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-attacher-d572d6f311a78a938f22648838d5b85c7c757c0b4cfba2d23f88721a4d969181.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-provisioner-8bb2b99746ddac4a5c72285e2a887bad3d733c5ad66e4f139326a5d8e3bca70e.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-resizer-8ea3c5ce40e31197c5f1f1b922a9b976a5f6bffe499c4a4c6b6db468bc2a421d.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_ebs-plugin-dc900b9e6db16ea65db1bad89d640664140423a92868735f45e1389af16a4233.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_liveness-probe-ddb3d10390ebe8b9457ffddf7e375e4d5d42ae9b7c3d0f52f94baa459527f2fd.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_ebs-plugin-922bec251cadd0bc8c39edddceedaa48fc978968533bef0e47f4cfe1a9bc06b7.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_ebs-plugin-acb6c394d637726269f1fd5ea9818ecc1706596091338e60a4d3720d1e39deac.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_liveness-probe-3ef28982a1e8ed79e8500e05a07f203af6f379f4cd10f31d0dcbe30649271b68.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_liveness-probe-7fb635bdc56be11e79798b4e93150a933da72a0e5c17c13ab04e542ee474b651.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_node-driver-registrar-3dbefb298de8507fced55cfa673fc5513c4b9aecfcefb864196de4885bc180b9.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_node-driver-registrar-cf3ab228b12f1509984a0fc9ece0cb77672cd535936bf7aff366ffdce70cd4b6.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-27e3fe2cdbb873aef975b154c8007f769c5992b59226c8c3f059db1dc197ab4a.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-6b685d7c878bed82856f3adb5a4cc0587f114cc3af38e378504540166215c69a.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-init-3f779997a0b284a999b0505f1424a4b30af12d143a2a243a74dde7e2c9bd0de9.log,/var/log/containers/prometheus-0_lens-metrics_chown-394770bcd616d0c3d8380fcdbd07ca09fc00738fe17e5f15e5315c9d17312e25.log,/var/log/containers/prometheus-0_lens-metrics_prometheus-e713ff6ca1cb5d4e3d09fb1c07d70f4778efe32f94a4a4f89c7d5e3086ed866b.log | existing = /var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_copy-fluentd-config-e1c4560f70a672f811586c42239cd8f823c2da7afe504f49af7965f019091f57.log,/var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_fluentd-cloudwatch-0e493d532c0a48ae46aed7b6500431b93b0403acd74dd6ff92049c571be9e402.log,/var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_update-log-driver-a7799851e03ac287f48cbc63552c5b31016106061ba40493ad644e8a10016e62.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-eks-nodeagent-2a82275bdf85fdb8ac57a6d9e4c927919eb8472e10ffaf77a0290c291111d629.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-eks-nodeagent-a410bd11314ce2fff148d5effd863b8502f0aadf4d492c94c5d841c388b927f4.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-node-0f0417f969145e80e9de2474148256bf009ac84094d26453c53fd5c1c1b0ad6d.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-vpc-cni-init-ffcd1ff811ff67d406fe64096ef05cd9db75666ed1c8efbfbd303f7d09e3c95e.log,/var/log/containers/kube-proxy-4xl5d_kube-system_kube-proxy-32285f83bc32feb2f06700f235ff9db332b23c355b1b7c17b9deaab4a3bcf531.log,/var/log/containers/kube-proxy-4xl5d_kube-system_kube-proxy-a3726048ebd5dceb76fe36e6fadeff5010c6e242aef6bc8f73f4e935a1f4f88c.log,/var/log/containers/coredns-787cb67946-c7jg2_kube-system_coredns-170f21c4cd43ac571eadd5d2f7992734ac46ef62cfca08ae3b4dd9b0bcb7657c.log,/var/log/containers/coredns-787cb67946-c7jg2_kube-system_coredns-cd01a35e8ddbb4255538b165a64aede38b23cc6926a02dc606f7a568edd3a54d.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-attacher-d572d6f311a78a938f22648838d5b85c7c757c0b4cfba2d23f88721a4d969181.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-provisioner-8bb2b99746ddac4a5c72285e2a887bad3d733c5ad66e4f139326a5d8e3bca70e.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-resizer-8ea3c5ce40e31197c5f1f1b922a9b976a5f6bffe499c4a4c6b6db468bc2a421d.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_ebs-plugin-dc900b9e6db16ea65db1bad89d640664140423a92868735f45e1389af16a4233.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_liveness-probe-ddb3d10390ebe8b9457ffddf7e375e4d5d42ae9b7c3d0f52f94baa459527f2fd.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_ebs-plugin-922bec251cadd0bc8c39edddceedaa48fc978968533bef0e47f4cfe1a9bc06b7.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_ebs-plugin-acb6c394d637726269f1fd5ea9818ecc1706596091338e60a4d3720d1e39deac.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_liveness-probe-3ef28982a1e8ed79e8500e05a07f203af6f379f4cd10f31d0dcbe30649271b68.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_liveness-probe-7fb635bdc56be11e79798b4e93150a933da72a0e5c17c13ab04e542ee474b651.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_node-driver-registrar-3dbefb298de8507fced55cfa673fc5513c4b9aecfcefb864196de4885bc180b9.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_node-driver-registrar-cf3ab228b12f1509984a0fc9ece0cb77672cd535936bf7aff366ffdce70cd4b6.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-27e3fe2cdbb873aef975b154c8007f769c5992b59226c8c3f059db1dc197ab4a.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-6b685d7c878bed82856f3adb5a4cc0587f114cc3af38e378504540166215c69a.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-init-3f779997a0b284a999b0505f1424a4b30af12d143a2a243a74dde7e2c9bd0de9.log,/var/log/containers/prometheus-0_lens-metrics_chown-394770bcd616d0c3d8380fcdbd07ca09fc00738fe17e5f15e5315c9d17312e25.log,/var/log/containers/prometheus-0_lens-metrics_prometheus-e713ff6ca1cb5d4e3d09fb1c07d70f4778efe32f94a4a4f89c7d5e3086ed866b.log
2024-11-02 17:27:54 +0000 [info]: #0 [filter_kube_metadata_host] stats - namespace_cache_size: 0, pod_cache_size: 0
2024-11-02 17:27:54 +0000 [info]: #0 [filter_kube_metadata_host] stats - namespace_cache_size: 0, pod_cache_size: 0
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-hw4ds_default_logger-aba43bbd009d1652e1961dbd30ed45f09e337bfb42d3fa247b12fde7af248909.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-jtxmz_default_logger-742ba4e5339168b7b5442745705bbfed1d93c832027ca0c680b193c9c62e796f.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-kmrlv_default_logger-7682a4b64550055203e19ff9387b686e316fe4e5e7884b720dede3692659c686.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-ptf4k_default_logger-88c30f214da39c81d5fc04466eacddf79278dcd9f99402e5c051243e26b7218f.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-rnm4s_default_logger-df9566f71c1fd7ab074850d94ee4771ea24d9b653599a61cce791f7e221224c2.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-vvrtx_default_logger-37eb38772106129b0925b5fdb8bc20f378c6156ef510d787ec35c57fd3bd68bc.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-z9cxt_default_logger-c49720681936856bf6d2df5df3f35561a56d62f4c6a7d65aea8c7e0d70c37ad8.log failed. Continuing without tailing it.

Additional context

No response

Nov 03 '24 02:11 jicowan

Consistently seeing the following errors in the logs (changed the wait time to 60s):

2024-11-04 15:27:48 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log; waiting 60.0 seconds
2024-11-04 15:27:48 +0000 [warn]: #0 [in_tail_container_logs] Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log" inode=100695028 inode_in_pos_file=0
2024-11-04 15:27:48 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log; waiting 60.0 seconds
2024-11-04 15:27:48 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log

Contents of containers.log.pos file:

/var/log/containers/karpenter-76785c6874-gjsjq_karpenter-system_controller-2008ed03f1b7010e3a10bd6249585a91ea4f52b7bb807abdbffee2012e3634e5.log 0000000000009870        00000000025000d3
/var/log/containers/aws-guardduty-agent-kct6q_amazon-guardduty_aws-guardduty-agent-ce5502b765a04b99c5bc04c9cb3d110d6be023626430780c03c0df7ac25360fb.log 0000000000000e1f        000000000070bb0f
/var/log/containers/aws-guardduty-agent-kct6q_amazon-guardduty_aws-guardduty-agent-7b219cd69b2abd4809d569dd8810052a2c1cc2c139f42589b879db518fb42c98.log 0000000000000e1f        000000000070c062
/var/log/containers/karpenter-76785c6874-gjsjq_karpenter-system_controller-b7642902d9be8cf37a8f2e0e05bf858cdaa6e226a89947538d3856bf25d669a4.log 0000000000009018        00000000025000c1
/var/log/containers/node-exporter-tvjg5_lens-metrics_node-exporter-429a1e98cabdf9227e3d222649c64cbd37200d42148f0aa3c461a6293d25c57f.log 0000000000001fc2        0000000002e00b5e
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000006007bf4
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000006007bf9
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000006007bfb
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000006007bfc
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000006007bfd
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000006007bfb
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      ffffffffffffffff        0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      0000000000b83ff5        0000000006007bfc
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      0000000000000000        0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log      00000000003865bb        0000000006007bfd

Nov 04 '24 15:11 jicowan

Thanks for this report. We need to figure out the possible cause. I will investigate this weekend.

Nov 08 '24 10:11 daipom

Thanks. I've tried different combinations of settings since opening this issue, e.g. using a file buffer, increasing the chunk size, increasing the mem/CPU allocated to the fluentd daemonset, etc. None of them seems to have an impact on Fluentd's ability to tail the logs. It's as if it's losing track of the files it's supposed to tail. I have the notebook I've been using to find gaps in the sequence. Let me know if you want me to post it here.

Nov 08 '24 13:11 jicowan

@daipom I just ran a test where I set the kubelet's containerLogMaxSize to 50Mi (the default is 10Mi). After doing that I saw zero log loss. I'm not totally sure why that would be. My only guess is that the files are being rotated less often and so there are fewer files for fluentd to keep track of.

Nov 08 '24 22:11 jicowan

@daipom Do you think increasing the number of workers and allocating them to source block for @type tail would help with smaller log files?

Nov 11 '24 14:11 jicowan

I tried it briefly at my local environment, but I could not reproduce this. Do we need Kubernetes to reproduce it?

@jicowan Can you reproduce this without Kubernetes?

Nov 20 '24 02:11 Watson1978

I only tied this on k8s. I ran multiple replicas of it (at least 10). When the logs grew to 10MB, they were rotated by the kubelet. That's where I saw the issue. Fluentd lost track of the inodes because the files were being rotated so quickly.

Nov 20 '24 03:11 jicowan

@jicowan We are trying to reproduce this issue. Could you please tell us how to reproduce this in detail? I can run the node with the test application, but I don't know how to collect the output. Do we need another Fluetnd node to reproduce this? Or should we use sidecar?

Nov 27 '24 01:11 daipom

I think I need to have a file like /var/log/containers/... and collect it by in_tail, but I don't know how to do that. If I set up a pod as in To Reproduce, the logs will be output to standard output. Sorry I'm not familiar with K8s, but I need a detailed procedure to reproduce this.

Nov 27 '24 01:11 daipom

First you need a Kubernetes cluster (try not to use KIND, MiniKube, or another single node version of Kubernetes). Then you need to install the Fluentd DaemonSet. You can download the manifests from here. I used the version for Amazon CloudWatch, but you can use a different backend if you like. So long as it can absorb the volume of logs that you're sending to it, the choice of backend shouldn't effect the results of the tests. The default log file size is 10MB. At 10MB the kubelet (the Kubernetes "agent") will rotate the log file.

You can use the Kubernetes Deployment I created to deploy the logging application:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: logger-deployment
  labels:
    app: logger
spec:
  replicas: 1  # Adjust the number of replicas as needed
  selector:
    matchLabels:
      app: logger
  template:
    metadata:
      labels:
        app: logger
    spec:
      affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - logger
              topologyKey: "kubernetes.io/hostname"
      containers:
      - name: logger
        image: jicowan/logger:v3.0
        resources:
          requests:
            cpu: 4
            memory: 128Mi
          limits:
            cpu: 4
            memory: 256Mi
        env:
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name

The configuration for Fluentd is typically stored in a ConfigMap. If this isn't descriptive enough, I can walk you through the configuration during a web conference.

Dec 02 '24 16:12 jicowan

I can't verify this is happening yet, but it may be that the files are being rotated so fast that fluentd doesn't have enough time to read them before they are compressed. As the kubelet rotates the logs, it renames the file 0.log to 0.log.. It keeps that log uncompressed for 1 log rotation and then compresses it. If fluentd falls too far behind, it may not be able to read the log before it is compressed. Here is the kubelet code where this happens, https://github.com/kubernetes/kubernetes/blob/f545438bd347d2ac853b03983576bf0a6f1cc98b/pkg/kubelet/logs/container_log_manager.go#L400-L421. I assume there would be an error in the fluentd logs it could no longer read the log file, but I haven't seen such an error in my testing so far. There is no way to disable compression so I don't have a way to test my theory.

Dec 13 '24 15:12 jicowan

I'm trying to reproduce on local environment usingminikube. However, I can't reproduce it, yet. I'd like to know what is your environment on. Is your environment on AWS?

I'm going to try to reproduce on that environment.

Dec 18 '24 01:12 Watson1978

Yes, the environment was on AWS. You can use this eksctl configuration file to provision a similar environment. You can adjust the maximum size of the log file by changing the value of containerLogMaxSize. The default is 10Mi. The default containerLogMaxWorkers is 1. I also changed the storage type from gp3 to io1 because i was using a file buffer and wanted disk with better IO characteristics. You can change it back to gp3 if you want.

# An advanced example of ClusterConfig object with customised nodegroups:
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: logging
  region: us-west-2
  version: "1.30"

nodeGroups:
  - name: ng3
    instanceType: m5.4xlarge
    desiredCapacity: 2
    privateNetworking: true
    ssh:
      enableSsm: true 
    kubeletExtraConfig:
      containerLogMaxWorkers: 5
      containerLogMaxSize: "50Mi"
    ebsOptimized: true
    volumeType: io1
      
iam:
  withOIDC: true

accessConfig:
  authenticationMode: API_AND_CONFIG_MAP

vpc:
  nat:
    gateway: Single

If you send the logs to CloudWatch, you'll need to use IRSA or pod identities to assign an IAM role to the pod.

Dec 18 '24 15:12 jicowan

If the log files are rotated in a shorter time than specified in refresh_interval, it may not be handled properly. The workaround would be to shorten the refresh_interval, or increase the size limit of the rotation file to extend the rotation time.

Dec 20 '24 08:12 Watson1978

If you increse the size limit of the rotation file, due the fact fluentd read slower than the logs are written, in one moment you lost one of the rotation files.

Dec 20 '24 08:12 slopezxrd

The refresh interval is set to 1 @Watson1978. @slopezxrd I can't verify this yet, but if Fluentd is unable to read the logs fast enough, they will get compressed [by the Kubelet] before it has had time to read the whole file which will result in lost logs. If you look at the code for the Kubelet, it has already accounted for this once before, https://github.com/kubernetes/kubernetes/blob/f1b3fdf7e6d40714b1a43757221832aa1c4a49d1/pkg/kubelet/logs/container_log_manager.go#L451-L472.

Jan 07 '25 18:01 jicowan

Sorry for late response. I have been investigated this issue for a while.

Now I recommend following configuration for running on kubernetes.

Recommend configuration

<source>
  @type tail

  follow_inodes false
  rotate_wait 0
  path /var/log/containers/...path to your app logs...

...
</source>

follow_inodes false

With follow_inodes false, if a log file rotation is detected, a new log file may not be read until the refresh_interval has elapsed. I recommend to set follow_inodes true to avoid this behavior.

rotate_wait 0

With follow_inodes false, it will display many warning message of Skip update_watcher because watcher has been already updated.... The rotate_wait 0 might suppress this message and you can ignore the Skip update_watcher because watcher has been already updated... warning message.

There is no problem with Fluentd's behavior when that message is displayed.

path /var/log/containers/...path to your app logs...

There is symbolic link to the application log under /var/log/containers/. It would be sufficient to use that as the read target.

Warning messages

You can ignore the following warning messages. There is no problem with Fluentd's behavior when that message is displayed.

Skip update_watcher because watcher has been already updated...
Could not follow a file (inode: 101712298) because an existing watcher for that filepath follows a different inode...

I will fix these warning messages or relax warning log level.

Jan 24 '25 10:01 Watson1978

@Watson1978 Thanks for investigating! So, the problem is that the rotation occurs at very high speed. In that case, it is certainly better to set follow_inodes false (default) and rotate_wait 0.

Warning messages

You can ignore the following warning messages. There is no problem with Fluentd's behavior when that message is displayed.
* `Skip update_watcher because watcher has been already updated...`

* `Could not follow a file (inode: 101712298) because an existing watcher for that filepath follows a different inode...`
I will fix these warning messages or relax warning log level.

Yes! There was a bug in older versions that could cause in_tail collection to stop without an error log. These warning logs were placed at that time as a precaution.

In this case, the fast rotation causes this warning, but there seems to be no problem with the collection. So, as @Watson1978 says, you can ignore these warnings. These logs should be fixed, considering the case of fast rotations.

Jan 27 '25 02:01 daipom

During heavy log volumes, e.g. >10k log entries per second, fluentd consistently drops logs.

Hmm, does setting follow_inodes false and rotate_wait 0 causes log lost? Looks like we need to investigate log lost problem more.

Jan 27 '25 02:01 daipom

I've tried setting follow_inodes to true and false. I see log loss in both instances. My refresh_interval is currently set to 1, my rotate_wait is set to 0. I think Fluentd is falling so far behind when tailing the logs that the log file is getting compressed before it can finish reading the file.

Jan 27 '25 20:01 jicowan

@jicowan Sorry for my late response. I have investigated this issue and I have found the cause.

As a conclusion, if there is a log file that receives a high volume of logs faster than in_tail can read, it makes the collection unstable. For such files, please separate the in_tail setting to multiple <source>, knowing that it will be unstable. Do not mix settings with other normal-size file collections into one <source>.

In addition, if such files exist, the following settings will help stabilize the collection by limiting the amount of collection per unit of time, but please note that it will reduce totall throughput.

If such files exist, it will be fundamentally challenging to prevent log loss, but I would be willing to consider possible improvements in future versions.

Here are the details.

Cause

If a log file receives a high volume of logs faster than in_tail can read, it becomes busy and causes delays in other processes on that in_tail config.
If log rotation detection is too slow and multiple rotations occur in the meantime, some files may be missed and not collected.

Workarounds

Split in_tail config into smaller:
- Avoid specifying too many log targets in a single path.
- This can improve performance since each in_tail config runs in its own thread.
Use read_bytes_limit_per_second:
- Prevents large log files from blocking other processing.
- Note: May reduce overall throughput.
Use the <group> directive:
- Similar to read_bytes_limit_per_second, but more powerful.
- Allows limiting logs by group, such as per pod.
- Useful to prevent high volume pods from affecting log collection from others.

Other Notes

follow_inodes is usually not relevant:
- It's generally not needed unless there's a specific reason.
- Avoid using it if the log rotation interval is shorter than refresh_interval.
rotate_wait also has limited effect, but should be set lower than refresh_interval.
Frequent warning logs like the following may indicate the in_tail is too busy that it cannot collect all data stably:
- Could not follow a file ...
- stat() for ... failed. Continuing without tailing it.
- Skip update_watcher because ...

Apr 11 '25 10:04 daipom

Thanks for investigating this issue @daipom. I don't think we can use <group> here because the container runtime is programmed to write logs to a file, e.g. /var/log/containers/container_id/log_0.log, and the kubelet is rotates that file (renames the file when it reaches a particular size, log_1.log, and compresses it after 2 rotations). I could see using <group> if I were only interested in capturing the logs of a few containers, but for my use case I need to capture all logs from all containers. Can you make a single instance of in_tail multi-threaded so the logs are read faster? I don't see this issue with Fluent Bit. I assume it's because it's written in C++.

Apr 11 '25 21:04 jicowan

I see. Thanks. Then, currently, read_bytes_limit_per_second setting will help stabilize the collection.

Can you make a single instance of in_tail multi-threaded so the logs are read faster? I don't see this issue with Fluent Bit. I assume it's because it's written in C++.

That would be a much larger revision. And it must be made very carefully to avoid making new bugs.

Besides, I don't know how much multi-threading would increase the overall speed. Since Ruby has GVL, if the reading cannot keep up with the current implementation, then multi-threading would not help increase speed very much. It could be the same as setting read_bytes_limit_per_second.

I will do a little research to see if there are improvements that might be possible for v1.19, but I don't think we will be able to make a very big fix in time.

Apr 14 '25 09:04 daipom

If you have too many logs for a particular file, you can use the workaround described in https://github.com/fluent/fluentd/issues/4693#issuecomment-2796483918.

If there are too many logs overall to be read, then we can use Fluentd's multi-worker feature, or run multiple Fluentd instances, or reduce the amount of logs generated.

Apr 14 '25 09:04 daipom

Now this issue is to discuss improvements in the following points. (Maybe we should create another issue, but for now...)

As a conclusion, if there is a log file that receives a high volume of logs faster than in_tail can read, it makes the collection unstable. For such files, please separate the in_tail setting to multiple <source>, knowing that it will be unstable. Do not mix settings with other normal-size file collections into one <source>.

In addition, if such files exist, the following settings will help stabilize the collection by limiting the amount of collection per unit of time, but please note that it will reduce totall throughput.

read_bytes_limit_per_second

<group>

If such files exist, it will be fundamentally challenging to prevent log loss, but I would be willing to consider possible improvements in future versions.

Cause

If a log file receives a high volume of logs faster than in_tail can read, it becomes busy and causes delays in other processes on that in_tail config.

If log rotation detection is too slow and multiple rotations occur in the meantime, some files may be missed and not collected.

Can you make a single instance of in_tail multi-threaded so the logs are read faster? I don't see this issue with Fluent Bit. I assume it's because it's written in C++.

That would be a much larger revision. And it must be made very carefully to avoid making new bugs.

Besides, I don't know how much multi-threading would increase the overall speed. Since Ruby has GVL, if the reading cannot keep up with the current implementation, then multi-threading would not help increase speed very much. It could be the same as setting read_bytes_limit_per_second.

I will do a little research to see if there are improvements that might be possible for v1.19, but I don't think we will be able to make a very big fix in time.

Apr 17 '25 00:04 daipom

Ran into a similar issue.

I increased the number of workers and broke up the log file reading into separate workers explicitly.

Before I did this, fluentd was not using more than 1000m (or the equivalent of 1 core), after I saw it start using more than >1000m

example config:

    <system>
      workers 3
    </system>
    <worker 0>
      <source>
        @type tail
        path /var/log/containers/*.log
        exclude_path ["/var/log/containers/*heavyloggingcontainer*.log"]
        pos_file /opt/fluentd-containers.pos
        tag kubernetes.*
        read_from_head true
        refresh_interval 5
        rotate_wait 0
        follow_inodes true
        <parse>
          @type cri
        </parse>
      </source>
    </worker>
    <worker 1>
      <source>
        @type tail
        path /var/log/containers/*heavyloggingcontainer*.log
        pos_file /opt/fluentd-containers-1.pos
        tag kubernetes.*
        read_from_head true
        refresh_interval 5
        rotate_wait 0
        follow_inodes true
        <parse>
          @type cri
        </parse>
      </source>
    </worker>

Nov 18 '25 17:11 vsilgalis

Logs missing during heavy log volume

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

Recommend configuration

follow_inodes false

rotate_wait 0

path /var/log/containers/...path to your app logs...

Warning messages

Warning messages

Cause

Workarounds

Other Notes

Cause