Logs missing during heavy log volume
Describe the bug
During heavy log volumes, e.g. >10k log entries per second, fluentd consistently drops logs. It may be related to log rotation (on Kubernetes). When I ran a load test, I see the following entries in the fluentd logs:
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 101712298) because an existing watcher for that filepath follows a different inode: 101712295 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-mzxxh_default_logger-8bb9a8d2eb65d5c07af7e194aad99176a79941a69c06b6ae390a0d8b9dd06cf1.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 97581155) because an existing watcher for that filepath follows a different inode: 97581154 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-nrq45_default_logger-2bad2e8722fb2369996c134f02dcf4a2fff8068d43863d3f7173a56ff2a8bbd0.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 111149786) because an existing watcher for that filepath follows a different inode: 111149782 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-p4rcl_default_logger-88fb9eaab07505f6d59f03e48e2993069eba82902efe44a46098c0d7d44f24c4.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 77634742) because an existing watcher for that filepath follows a different inode: 77634741 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-ps45w_default_logger-90f54592392569f72662a2dacfdca239a907c1da4c1729f7a75bb50f56bc9663.log"
When I added follow_inodes=true and rotate_wait=0 to the container configuration, the errors went away, but large chunks of logs were still missing and the following entries appeared in the fluentd logs.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-hw4ds_default_logger-aba43bbd009d1652e1961dbd30ed45f09e337bfb42d3fa247b12fde7af248909.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-jtxmz_default_logger-742ba4e5339168b7b5442745705bbfed1d93c832027ca0c680b193c9c62e796f.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-kmrlv_default_logger-7682a4b64550055203e19ff9387b686e316fe4e5e7884b720dede3692659c686.log failed. Continuing without tailing it.
I am running the latest version of the fluentd kubernetes daemonset for cloudwatch, fluent/fluentd-kubernetes-daemonset:v1.17.1-debian-cloudwatch-1.2.
During the test, both memory and CPU utilization for fluentd remained fairly low.
To Reproduce
Run multiple replicas of the following program:
import multiprocessing
import os
import time
import random
import sys
from datetime import datetime
def generate_log_entry():
log_levels = ['INFO', 'WARNING', 'ERROR', 'DEBUG']
messages = [
'User logged in',
'Database connection established',
'File not found',
'Memory usage high',
'Network latency detected',
'Cache cleared',
'API request successful',
'Configuration updated'
]
timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]
level = random.choice(log_levels)
message = random.choice(messages)
pod = os.getenv("POD_NAME", "unknown")
return f"{timestamp} {pod} [{level}] {message}"
def worker(queue):
while True:
log_entry = generate_log_entry()
queue.put(log_entry)
time.sleep(0.01) # Small delay to prevent overwhelming the system
def logger(queue, counter):
while True:
log_entry = queue.get()
with counter.get_lock():
counter.value += 1
print(f"[{counter.value}] {log_entry}", flush=True)
if __name__ == '__main__':
num_processes = multiprocessing.cpu_count()
manager = multiprocessing.Manager()
log_queue = manager.Queue()
# Create a shared counter
counter = multiprocessing.Value('i', 0)
# Start worker processes
workers = []
for _ in range(num_processes - 1): # Reserve one process for logging
p = multiprocessing.Process(target=worker, args=(log_queue,))
p.start()
workers.append(p)
# Start logger process
logger_process = multiprocessing.Process(target=logger, args=(log_queue, counter))
logger_process.start()
try:
# Keep the main process running
while True:
time.sleep(1)
# Print the current count every second
print(f"Total logs emitted: {counter.value}", file=sys.stderr, flush=True)
except KeyboardInterrupt:
print("\nStopping log generation...", file=sys.stderr)
# Stop worker processes
for p in workers:
p.terminate()
p.join()
# Stop logger process
logger_process.terminate()
logger_process.join()
print(f"Log generation stopped. Total logs emitted: {counter.value}", file=sys.stderr)
sys.exit(0)
Here's the deployment for the test application:
apiVersion: apps/v1
kind: Deployment
metadata:
name: logger-deployment
labels:
app: logger
spec:
replicas: 1 # Adjust the number of replicas as needed
selector:
matchLabels:
app: logger
template:
metadata:
labels:
app: logger
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- logger
topologyKey: "kubernetes.io/hostname"
containers:
- name: logger
image: jicowan/logger:v3.0
resources:
requests:
cpu: 4
memory: 128Mi
limits:
cpu: 4
memory: 256Mi
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
Here's the container.conf file for fluentd:
<source>
@type tail
@id in_tail_container_core_logs
@label @raw.containers
@log_level debug
path /var/log/containers/*fluentd-cloudwatch*.log,/var/log/containers/*aws-node*.log,/var/log/containers/*kube-proxy*.log,/var/log/containers/*kube-system*.log,/var/log/containers/cloudwatch-agent*.log,/var/log/containers/policy-manager*.log,/var/log/containers/*private-ca*.log,/var/log/containers/metrics-server*.log,/var/log/containers/rbac-controller*.log,/var/log/containers/cluster-autoscaler*.log,/var/log/containers/cwagent*.log,/var/log/containers/*prometheus*.log,/var/log/containers/*nginx*.log,/var/log/containers/*kube-state*.log
pos_file /var/log/fluentd-core-containers.log.pos
tag corecontainers.**
read_from_head true
follow_inodes true
rotate_wait 0
<parse>
@type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
time_format %Y-%m-%dT%H:%M:%S.%N%:z
</parse>
</source>
<source>
@type tail
@id in_tail_container_logs
@label @raw.containers
path /var/log/containers/*.log
exclude_path /var/log/containers/*aws-node*.log,/var/log/containers/*coredns*.log,/var/log/containers/*kube-proxy*.log,/var/log/containers/*kube-system*.log,/var/log/containers/cloudwatch-agent*.log,/var/log/containers/policy-manager*.log,/var/log/containers/*private-ca*.log,/var/log/containers/metrics-server*.log,/var/log/containers/rbac-controller*.log,/var/log/containers/cluster-autoscaler*.log,/var/log/containers/cwagent*.log,/var/log/containers/*prometheus*.log,/var/log/containers/*nginx*.log,/var/log/containers/*opa*.log,/var/log/containers/*fluentd-cloudwatch*.log,/var/log/containers/*datadog-agent*.log,/var/log/containers/*kube-state-metrics*.log,/var/log/containers/*ebs-csi-node*.log,/var/log/containers/*ebs-csi-controller*.log,/var/log/containers/*fsx-csi-node*.log,/var/log/containers/*calico-node*.log
pos_file /var/log/fluentd-containers.log.pos
tag container.**
read_from_head true
follow_inodes true
rotate_wait 0
<parse>
@type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
time_format %Y-%m-%dT%H:%M:%S.%N%:z
</parse>
</source>
<source>
@type tail
@id in_tail_daemonset_logs
@label @containers
path /var/log/containers/*opa*.log,/var/log/containers/*datadog-agent*.log,/var/log/containers/*ebs-csi-node*.log,/var/log/containers/*ebs-csi-controller*.log,/var/log/containers/*fsx-csi-node*.log,/var/log/containers/*calico-node*.log
pos_file /var/log/daemonset.log.pos
tag daemonset.**
read_from_head true
follow_inodes true
rotate_wait 0
<parse>
@type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
time_format %Y-%m-%dT%H:%M:%S.%N%:z
</parse>
</source>
<label @raw.containers>
<match **>
@id raw.detect_exceptions
@type detect_exceptions
remove_tag_prefix raw
@label @containers
multiline_flush_interval 1s
max_bytes 500000
max_lines 1000
</match>
</label>
<label @containers>
<filter corecontainers.**>
@type prometheus
<metric>
name fluentd_input_status_num_corecontainer_records_total
type counter
desc The total number of incoming corecontainer records
</metric>
</filter>
<filter container.**>
@type prometheus
<metric>
name fluentd_input_status_num_container_records_total
type counter
desc The total number of incoming container records
</metric>
</filter>
<filter daemonset.**>
@type prometheus
<metric>
name fluentd_input_status_num_daemonset_records_total
type counter
desc The total number of incoming daemonset records
</metric>
</filter>
<filter **>
@type record_transformer
@id filter_containers_stream_transformer
<record>
seal_id "110628"
cluster_name "logging"
stream_name ${tag_parts[4]}
</record>
</filter>
<filter **>
@type kubernetes_metadata
@id filter_kube_metadata
@log_level error
</filter>
<match corecontainers.**>
@type copy
<store>
@type prometheus
<metric>
name fluentd_output_status_num_corecontainer_records_total
type counter
desc The total number of outgoing corecontainer records
</metric>
</store>
<store>
@type cloudwatch_logs
@id out_cloudwatch_logs_core_containers
region "us-west-2"
log_group_name "/aws/eks/logging/core-containers"
log_stream_name_key stream_name
remove_log_stream_name_key true
auto_create_stream true
<inject>
time_key time_nanoseconds
time_type string
time_format %Y-%m-%dT%H:%M:%S.%N
</inject>
<buffer>
flush_interval 5s
chunk_limit_size 2m
queued_chunks_limit_size 32
retry_forever true
</buffer>
</store>
</match>
<match container.**>
@type copy
<store>
@type prometheus
<metric>
name fluentd_output_status_num_container_records_total
type counter
desc The total number of outgoing container records
</metric>
</store>
<store>
@type cloudwatch_logs
@id out_cloudwatch_logs_containers
region "us-west-2"
log_group_name "/aws/eks/logging/containers"
log_stream_name_key stream_name
remove_log_stream_name_key true
auto_create_stream true
<inject>
time_key time_nanoseconds
time_type string
time_format %Y-%m-%dT%H:%M:%S.%N
</inject>
<buffer>
flush_interval 5s
chunk_limit_size 2m
queued_chunks_limit_size 32
retry_forever true
</buffer>
</store>
</match>
<match daemonset.**>
@type copy
<store>
@type prometheus
<metric>
name fluentd_output_status_num_daemonset_records_total
type counter
desc The total number of outgoing daemonset records
</metric>
</store>
<store>
@type cloudwatch_logs
@id out_cloudwatch_logs_daemonset
region "us-west-2"
log_group_name "/aws/eks/logging/daemonset"
log_stream_name_key stream_name
remove_log_stream_name_key true
auto_create_stream true
<inject>
time_key time_nanoseconds
time_type string
time_format %Y-%m-%dT%H:%M:%S.%N
</inject>
<buffer>
flush_interval 5s
chunk_limit_size 2m
queued_chunks_limit_size 32
retry_forever true
</buffer>
</store>
</match>
</label>
Expected behavior
The test application assigns an sequence number to each log entry. I have a Python notebook that flattens the json log output, sorts the logs by sequence number, then finds gaps in the sequence. This is how I know that fluentd is dropping logs. If everything is working as it should there should be no log loss.
I ran the same tests with fluent bit and experience no log loss.
Your Environment
- Fluentd version: v1.17.1
- Package version:
- Operating system: Amazon Linux 2
- Kernel version: 5.10.225-213.878.amzn2.x86_64
Your Configuration
data:
containers.conf: |-
<source>
@type tail
@id in_tail_container_core_logs
@label @raw.containers
@log_level debug
path /var/log/containers/*fluentd-cloudwatch*.log,/var/log/containers/*aws-node*.log,/var/log/containers/*kube-proxy*.log,/var/log/containers/*kube-system*.log,/var/log/containers/cloudwatch-agent*.log,/var/log/containers/policy-manager*.log,/var/log/containers/*private-ca*.log,/var/log/containers/metrics-server*.log,/var/log/containers/rbac-controller*.log,/var/log/containers/cluster-autoscaler*.log,/var/log/containers/cwagent*.log,/var/log/containers/*prometheus*.log,/var/log/containers/*nginx*.log,/var/log/containers/*kube-state*.log
pos_file /var/log/fluentd-core-containers.log.pos
tag corecontainers.**
read_from_head true
follow_inodes true
rotate_wait 0
<parse>
@type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
time_format %Y-%m-%dT%H:%M:%S.%N%:z
</parse>
</source>
<source>
@type tail
@id in_tail_container_logs
@label @raw.containers
path /var/log/containers/*.log
exclude_path /var/log/containers/*aws-node*.log,/var/log/containers/*coredns*.log,/var/log/containers/*kube-proxy*.log,/var/log/containers/*kube-system*.log,/var/log/containers/cloudwatch-agent*.log,/var/log/containers/policy-manager*.log,/var/log/containers/*private-ca*.log,/var/log/containers/metrics-server*.log,/var/log/containers/rbac-controller*.log,/var/log/containers/cluster-autoscaler*.log,/var/log/containers/cwagent*.log,/var/log/containers/*prometheus*.log,/var/log/containers/*nginx*.log,/var/log/containers/*opa*.log,/var/log/containers/*fluentd-cloudwatch*.log,/var/log/containers/*datadog-agent*.log,/var/log/containers/*kube-state-metrics*.log,/var/log/containers/*ebs-csi-node*.log,/var/log/containers/*ebs-csi-controller*.log,/var/log/containers/*fsx-csi-node*.log,/var/log/containers/*calico-node*.log
pos_file /var/log/fluentd-containers.log.pos
tag container.**
read_from_head true
follow_inodes true
rotate_wait 0
<parse>
@type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
time_format %Y-%m-%dT%H:%M:%S.%N%:z
</parse>
</source>
<source>
@type tail
@id in_tail_daemonset_logs
@label @containers
path /var/log/containers/*opa*.log,/var/log/containers/*datadog-agent*.log,/var/log/containers/*ebs-csi-node*.log,/var/log/containers/*ebs-csi-controller*.log,/var/log/containers/*fsx-csi-node*.log,/var/log/containers/*calico-node*.log
pos_file /var/log/daemonset.log.pos
tag daemonset.**
read_from_head true
follow_inodes true
rotate_wait 0
<parse>
@type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
time_format %Y-%m-%dT%H:%M:%S.%N%:z
</parse>
</source>
<label @raw.containers>
<match **>
@id raw.detect_exceptions
@type detect_exceptions
remove_tag_prefix raw
@label @containers
multiline_flush_interval 1s
max_bytes 500000
max_lines 1000
</match>
</label>
<label @containers>
<filter corecontainers.**>
@type prometheus
<metric>
name fluentd_input_status_num_corecontainer_records_total
type counter
desc The total number of incoming corecontainer records
</metric>
</filter>
<filter container.**>
@type prometheus
<metric>
name fluentd_input_status_num_container_records_total
type counter
desc The total number of incoming container records
</metric>
</filter>
<filter daemonset.**>
@type prometheus
<metric>
name fluentd_input_status_num_daemonset_records_total
type counter
desc The total number of incoming daemonset records
</metric>
</filter>
<filter **>
@type record_transformer
@id filter_containers_stream_transformer
<record>
seal_id "110628"
cluster_name "logging"
stream_name ${tag_parts[4]}
</record>
</filter>
<filter **>
@type kubernetes_metadata
@id filter_kube_metadata
@log_level error
</filter>
<match corecontainers.**>
@type copy
<store>
@type prometheus
<metric>
name fluentd_output_status_num_corecontainer_records_total
type counter
desc The total number of outgoing corecontainer records
</metric>
</store>
<store>
@type cloudwatch_logs
@id out_cloudwatch_logs_core_containers
region "us-west-2"
log_group_name "/aws/eks/logging/core-containers"
log_stream_name_key stream_name
remove_log_stream_name_key true
auto_create_stream true
<inject>
time_key time_nanoseconds
time_type string
time_format %Y-%m-%dT%H:%M:%S.%N
</inject>
<buffer>
flush_interval 5s
chunk_limit_size 2m
queued_chunks_limit_size 32
retry_forever true
</buffer>
</store>
</match>
<match container.**>
@type copy
<store>
@type prometheus
<metric>
name fluentd_output_status_num_container_records_total
type counter
desc The total number of outgoing container records
</metric>
</store>
<store>
@type cloudwatch_logs
@id out_cloudwatch_logs_containers
region "us-west-2"
log_group_name "/aws/eks/logging/containers"
log_stream_name_key stream_name
remove_log_stream_name_key true
auto_create_stream true
<inject>
time_key time_nanoseconds
time_type string
time_format %Y-%m-%dT%H:%M:%S.%N
</inject>
<buffer>
flush_interval 5s
chunk_limit_size 2m
queued_chunks_limit_size 32
retry_forever true
</buffer>
</store>
</match>
<match daemonset.**>
@type copy
<store>
@type prometheus
<metric>
name fluentd_output_status_num_daemonset_records_total
type counter
desc The total number of outgoing daemonset records
</metric>
</store>
<store>
@type cloudwatch_logs
@id out_cloudwatch_logs_daemonset
region "us-west-2"
log_group_name "/aws/eks/logging/daemonset"
log_stream_name_key stream_name
remove_log_stream_name_key true
auto_create_stream true
<inject>
time_key time_nanoseconds
time_type string
time_format %Y-%m-%dT%H:%M:%S.%N
</inject>
<buffer>
flush_interval 5s
chunk_limit_size 2m
queued_chunks_limit_size 32
retry_forever true
</buffer>
</store>
</match>
</label>
fluent.conf: |
@include containers.conf
@include systemd.conf
@include host.conf
<match fluent.**>
@type null
</match>
host.conf: |
<source>
@type tail
@id in_tail_dmesg
@label @hostlogs
path /var/log/dmesg
pos_file /var/log/dmesg.log.pos
tag host.dmesg
read_from_head true
<parse>
@type syslog
</parse>
</source>
<source>
@type tail
@id in_tail_secure
@label @hostlogs
path /var/log/secure
pos_file /var/log/secure.log.pos
tag host.secure
read_from_head true
<parse>
@type syslog
</parse>
</source>
<source>
@type tail
@id in_tail_messages
@label @hostlogs
path /var/log/messages
pos_file /var/log/messages.log.pos
tag host.messages
read_from_head true
<parse>
@type syslog
</parse>
</source>
<label @hostlogs>
<filter **>
@type kubernetes_metadata
@id filter_kube_metadata_host
watch false
</filter>
<filter **>
@type record_transformer
@id filter_containers_stream_transformer_host
<record>
stream_name ${tag}-${record["host"]}
</record>
</filter>
<match host.**>
@type cloudwatch_logs
@id out_cloudwatch_logs_host_logs
region "#{ENV.fetch('AWS_REGION')}"
log_group_name "/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/host"
log_stream_name_key stream_name
remove_log_stream_name_key true
auto_create_stream true
<buffer>
flush_interval 5
chunk_limit_size 2m
queued_chunks_limit_size 32
retry_forever true
</buffer>
</match>
</label>
kubernetes.conf: |
kubernetes.conf
systemd.conf: |
<source>
@type systemd
@id in_systemd_kubelet
@label @systemd
filters [{ "_SYSTEMD_UNIT": "kubelet.service" }]
<entry>
field_map {"MESSAGE": "message", "_HOSTNAME": "hostname", "_SYSTEMD_UNIT": "systemd_unit"}
field_map_strict true
</entry>
path /var/log/journal
<storage>
@type local
persistent true
path /var/log/fluentd-journald-kubelet-pos.json
</storage>
read_from_head true
tag kubelet.service
</source>
<source>
@type systemd
@id in_systemd_kubeproxy
@label @systemd
filters [{ "_SYSTEMD_UNIT": "kubeproxy.service" }]
<entry>
field_map {"MESSAGE": "message", "_HOSTNAME": "hostname", "_SYSTEMD_UNIT": "systemd_unit"}
field_map_strict true
</entry>
path /var/log/journal
<storage>
@type local
persistent true
path /var/log/fluentd-journald-kubeproxy-pos.json
</storage>
read_from_head true
tag kubeproxy.service
</source>
<source>
@type systemd
@id in_systemd_docker
@label @systemd
filters [{ "_SYSTEMD_UNIT": "docker.service" }]
<entry>
field_map {"MESSAGE": "message", "_HOSTNAME": "hostname", "_SYSTEMD_UNIT": "systemd_unit"}
field_map_strict true
</entry>
path /var/log/journal
<storage>
@type local
persistent true
path /var/log/fluentd-journald-docker-pos.json
</storage>
read_from_head true
tag docker.service
</source>
<label @systemd>
<filter **>
@type kubernetes_metadata
@id filter_kube_metadata_systemd
watch false
</filter>
<filter **>
@type record_transformer
@id filter_systemd_stream_transformer
<record>
stream_name ${tag}-${record["hostname"]}
</record>
</filter>
<match **>
@type cloudwatch_logs
@id out_cloudwatch_logs_systemd
region "#{ENV.fetch('AWS_REGION')}"
log_group_name "/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/dataplane"
log_stream_name_key stream_name
auto_create_stream true
remove_log_stream_name_key true
<buffer>
flush_interval 5
chunk_limit_size 2m
queued_chunks_limit_size 32
retry_forever true
</buffer>
</match>
</label>
Your Error Log
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 101712298) because an existing watcher for that filepath follows a different inode: 101712295 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-mzxxh_default_logger-8bb9a8d2eb65d5c07af7e194aad99176a79941a69c06b6ae390a0d8b9dd06cf1.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 97581155) because an existing watcher for that filepath follows a different inode: 97581154 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-nrq45_default_logger-2bad2e8722fb2369996c134f02dcf4a2fff8068d43863d3f7173a56ff2a8bbd0.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 111149786) because an existing watcher for that filepath follows a different inode: 111149782 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-p4rcl_default_logger-88fb9eaab07505f6d59f03e48e2993069eba82902efe44a46098c0d7d44f24c4.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 77634742) because an existing watcher for that filepath follows a different inode: 77634741 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-ps45w_default_logger-90f54592392569f72662a2dacfdca239a907c1da4c1729f7a75bb50f56bc9663.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 101712298) because an existing watcher for that filepath follows a different inode: 101712295 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-mzxxh_default_logger-8bb9a8d2eb65d5c07af7e194aad99176a79941a69c06b6ae390a0d8b9dd06cf1.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 97581155) because an existing watcher for that filepath follows a different inode: 97581154 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-nrq45_default_logger-2bad2e8722fb2369996c134f02dcf4a2fff8068d43863d3f7173a56ff2a8bbd0.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 111149786) because an existing watcher for that filepath follows a different inode: 111149782 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-p4rcl_default_logger-88fb9eaab07505f6d59f03e48e2993069eba82902efe44a46098c0d7d44f24c4.log"
2024-11-02 14:06:36 +0000 [warn]: #0 [in_tail_container_logs] Could not follow a file (inode: 77634742) because an existing watcher for that filepath follows a different inode: 77634741 (e.g. keeps watching a already rotated file). If you keep getting this message, please restart Fluentd. filepath="/var/log/containers/logger-deployment-57cc6745c7-ps45w_default_logger-90f54592392569f72662a2dacfdca239a907c1da4c1729f7a75bb50f56bc9663.log"
2024-11-02 14:15:49 +0000 [warn]: #0 [in_tail_container_logs] Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/logger-deployment-57cc6745c7-ps45w_default_logger-90f54592392569f72662a2dacfdca239a907c1da4c1729f7a75bb50f56bc9663.log" inode=77634746 inode_in_pos_file=77634747
***After setting time=0 inodes=true***
2024-11-02 17:26:28 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/logger-deployment-57cc6745c7-ckckh_default_logger-1c52a92c2e1ef377d9b6c95dc693b86645d93fbfcf13832ee5337cc9ab201b0b.log
2024-11-02 17:26:32 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-ckckh_default_logger-1c52a92c2e1ef377d9b6c95dc693b86645d93fbfcf13832ee5337cc9ab201b0b.log; waiting 0.0 seconds
2024-11-02 17:26:32 +0000 [warn]: #0 [in_tail_container_logs] Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/logger-deployment-57cc6745c7-ckckh_default_logger-1c52a92c2e1ef377d9b6c95dc693b86645d93fbfcf13832ee5337cc9ab201b0b.log" inode=152064097 inode_in_pos_file=0
2024-11-02 17:26:32 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-ckckh_default_logger-1c52a92c2e1ef377d9b6c95dc693b86645d93fbfcf13832ee5337cc9ab201b0b.log; waiting 0.0 seconds
2024-11-02 17:26:32 +0000 [warn]: #0 [in_tail_container_logs] Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/logger-deployment-57cc6745c7-ckckh_default_logger-1c52a92c2e1ef377d9b6c95dc693b86645d93fbfcf13832ee5337cc9ab201b0b.log" inode=152064099 inode_in_pos_file=0
2024-11-02 17:26:32 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-zwzxv_default_logger-af33706631b5c04250aa71c6956fde092559f09f8891e007dd8d454b12e89135.log; waiting 0.0 seconds
2024-11-02 17:26:32 +0000 [warn]: #0 [in_tail_container_logs] Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/logger-deployment-57cc6745c7-zwzxv_default_logger-af33706631b5c04250aa71c6956fde092559f09f8891e007dd8d454b12e89135.log" inode=112237023 inode_in_pos_file=0
2024-11-02 17:27:48 +0000 [debug]: #0 [in_tail_container_core_logs] tailing paths: target = /var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_copy-fluentd-config-dc7b79cd11ccf90f5b8c512c1552ae13b28abfb2400b2ecd03c12d0ae7ceb564.log,/var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_fluentd-cloudwatch-bc8e8da1056c6759e099f6b5b983d44ae7940a4963e376940b3ccacb18a6ab26.log,/var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_update-log-driver-992ee8554687722124787066407ad9b21e97e3382b08a216205fda34259a0e03.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-eks-nodeagent-d43b788731adaea1b1e53e23b0cd6c6aa4c15b41afd3f61ccb4f0fe466ae8d30.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-node-688f632cd4bffd057003bcfa31b3546f4d64546e737645174cebc611f97e8e15.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-vpc-cni-init-f59e23252a414c9f2041222c095f86766775eb70d37dd3fd89690978f2f554d0.log,/var/log/containers/kube-proxy-6z8zd_kube-system_kube-proxy-a1aae65c089af12b388a0527ebf25f7418eed956da5b284dace2702d58f422df.log,/var/log/containers/coredns-787cb67946-6dfhl_kube-system_coredns-f8b53737ad2d4133a9d9ac69f9f56bfbc9e7afb54d3dc91e6f7489009365ea17.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-attacher-6530ac17c228aeca7e39958a1aa2f02da5878bf3b6b2fb643b5f43b53fcdf0b9.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-provisioner-d3d1c4db5b0837aabf2cb3676951e85bd63c8d432b47b07770ad3d226f3be522.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-resizer-ea911f783028d85009ebe185d03d602a8eb64fa2fe80da03082703caa69584d8.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_ebs-plugin-db350e781604de4725003c8f38a03f4ca2a1eec021c61005565a3caff3cd4733.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_liveness-probe-db10e53f8e6ecef8fab33ca7e68db83f3070dc406680fc4eb6858bffe431a37f.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_ebs-plugin-bb331132e02cb3ee93c1a2cf5225cd14b2b2d063846e5e1e578665d0679d23ec.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_liveness-probe-a5f50e5e9490b16833b6fed1d29caf9ccb352dbb8852ec4cf5c93781ad61afd2.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_node-driver-registrar-9d0b426f9ebb91798f1d9d444a6d728b09f926794c471229e6f5f4d54891a07a.log,/var/log/containers/eks-pod-identity-agent-p9szr_kube-system_eks-pod-identity-agent-b93a02fa5321cba6f33ca5b809c948f9469ea8ffa2f320443960009196ba520a.log,/var/log/containers/eks-pod-identity-agent-p9szr_kube-system_eks-pod-identity-agent-init-b02cdb94178b436faaaf7f9a1e97d131046b38716434e2db474b1d5026a66ff0.log | existing = /var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_copy-fluentd-config-dc7b79cd11ccf90f5b8c512c1552ae13b28abfb2400b2ecd03c12d0ae7ceb564.log,/var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_fluentd-cloudwatch-bc8e8da1056c6759e099f6b5b983d44ae7940a4963e376940b3ccacb18a6ab26.log,/var/log/containers/fluentd-cloudwatch-ztwmc_amazon-cloudwatch_update-log-driver-992ee8554687722124787066407ad9b21e97e3382b08a216205fda34259a0e03.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-eks-nodeagent-d43b788731adaea1b1e53e23b0cd6c6aa4c15b41afd3f61ccb4f0fe466ae8d30.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-node-688f632cd4bffd057003bcfa31b3546f4d64546e737645174cebc611f97e8e15.log,/var/log/containers/aws-node-vgl9d_kube-system_aws-vpc-cni-init-f59e23252a414c9f2041222c095f86766775eb70d37dd3fd89690978f2f554d0.log,/var/log/containers/kube-proxy-6z8zd_kube-system_kube-proxy-a1aae65c089af12b388a0527ebf25f7418eed956da5b284dace2702d58f422df.log,/var/log/containers/coredns-787cb67946-6dfhl_kube-system_coredns-f8b53737ad2d4133a9d9ac69f9f56bfbc9e7afb54d3dc91e6f7489009365ea17.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-attacher-6530ac17c228aeca7e39958a1aa2f02da5878bf3b6b2fb643b5f43b53fcdf0b9.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-provisioner-d3d1c4db5b0837aabf2cb3676951e85bd63c8d432b47b07770ad3d226f3be522.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_csi-resizer-ea911f783028d85009ebe185d03d602a8eb64fa2fe80da03082703caa69584d8.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_ebs-plugin-db350e781604de4725003c8f38a03f4ca2a1eec021c61005565a3caff3cd4733.log,/var/log/containers/ebs-csi-controller-5ddc98b494-n2c22_kube-system_liveness-probe-db10e53f8e6ecef8fab33ca7e68db83f3070dc406680fc4eb6858bffe431a37f.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_ebs-plugin-bb331132e02cb3ee93c1a2cf5225cd14b2b2d063846e5e1e578665d0679d23ec.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_liveness-probe-a5f50e5e9490b16833b6fed1d29caf9ccb352dbb8852ec4cf5c93781ad61afd2.log,/var/log/containers/ebs-csi-node-5w6n2_kube-system_node-driver-registrar-9d0b426f9ebb91798f1d9d444a6d728b09f926794c471229e6f5f4d54891a07a.log,/var/log/containers/eks-pod-identity-agent-p9szr_kube-system_eks-pod-identity-agent-b93a02fa5321cba6f33ca5b809c948f9469ea8ffa2f320443960009196ba520a.log,/var/log/containers/eks-pod-identity-agent-p9szr_kube-system_eks-pod-identity-agent-init-b02cdb94178b436faaaf7f9a1e97d131046b38716434e2db474b1d5026a66ff0.log
2024-11-02 17:27:49 +0000 [debug]: #0 [in_tail_container_core_logs] tailing paths: target = /var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_copy-fluentd-config-e1c4560f70a672f811586c42239cd8f823c2da7afe504f49af7965f019091f57.log,/var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_fluentd-cloudwatch-0e493d532c0a48ae46aed7b6500431b93b0403acd74dd6ff92049c571be9e402.log,/var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_update-log-driver-a7799851e03ac287f48cbc63552c5b31016106061ba40493ad644e8a10016e62.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-eks-nodeagent-2a82275bdf85fdb8ac57a6d9e4c927919eb8472e10ffaf77a0290c291111d629.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-eks-nodeagent-a410bd11314ce2fff148d5effd863b8502f0aadf4d492c94c5d841c388b927f4.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-node-0f0417f969145e80e9de2474148256bf009ac84094d26453c53fd5c1c1b0ad6d.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-vpc-cni-init-ffcd1ff811ff67d406fe64096ef05cd9db75666ed1c8efbfbd303f7d09e3c95e.log,/var/log/containers/kube-proxy-4xl5d_kube-system_kube-proxy-32285f83bc32feb2f06700f235ff9db332b23c355b1b7c17b9deaab4a3bcf531.log,/var/log/containers/kube-proxy-4xl5d_kube-system_kube-proxy-a3726048ebd5dceb76fe36e6fadeff5010c6e242aef6bc8f73f4e935a1f4f88c.log,/var/log/containers/coredns-787cb67946-c7jg2_kube-system_coredns-170f21c4cd43ac571eadd5d2f7992734ac46ef62cfca08ae3b4dd9b0bcb7657c.log,/var/log/containers/coredns-787cb67946-c7jg2_kube-system_coredns-cd01a35e8ddbb4255538b165a64aede38b23cc6926a02dc606f7a568edd3a54d.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-attacher-d572d6f311a78a938f22648838d5b85c7c757c0b4cfba2d23f88721a4d969181.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-provisioner-8bb2b99746ddac4a5c72285e2a887bad3d733c5ad66e4f139326a5d8e3bca70e.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-resizer-8ea3c5ce40e31197c5f1f1b922a9b976a5f6bffe499c4a4c6b6db468bc2a421d.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_ebs-plugin-dc900b9e6db16ea65db1bad89d640664140423a92868735f45e1389af16a4233.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_liveness-probe-ddb3d10390ebe8b9457ffddf7e375e4d5d42ae9b7c3d0f52f94baa459527f2fd.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_ebs-plugin-922bec251cadd0bc8c39edddceedaa48fc978968533bef0e47f4cfe1a9bc06b7.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_ebs-plugin-acb6c394d637726269f1fd5ea9818ecc1706596091338e60a4d3720d1e39deac.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_liveness-probe-3ef28982a1e8ed79e8500e05a07f203af6f379f4cd10f31d0dcbe30649271b68.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_liveness-probe-7fb635bdc56be11e79798b4e93150a933da72a0e5c17c13ab04e542ee474b651.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_node-driver-registrar-3dbefb298de8507fced55cfa673fc5513c4b9aecfcefb864196de4885bc180b9.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_node-driver-registrar-cf3ab228b12f1509984a0fc9ece0cb77672cd535936bf7aff366ffdce70cd4b6.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-27e3fe2cdbb873aef975b154c8007f769c5992b59226c8c3f059db1dc197ab4a.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-6b685d7c878bed82856f3adb5a4cc0587f114cc3af38e378504540166215c69a.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-init-3f779997a0b284a999b0505f1424a4b30af12d143a2a243a74dde7e2c9bd0de9.log,/var/log/containers/prometheus-0_lens-metrics_chown-394770bcd616d0c3d8380fcdbd07ca09fc00738fe17e5f15e5315c9d17312e25.log,/var/log/containers/prometheus-0_lens-metrics_prometheus-e713ff6ca1cb5d4e3d09fb1c07d70f4778efe32f94a4a4f89c7d5e3086ed866b.log | existing = /var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_copy-fluentd-config-e1c4560f70a672f811586c42239cd8f823c2da7afe504f49af7965f019091f57.log,/var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_fluentd-cloudwatch-0e493d532c0a48ae46aed7b6500431b93b0403acd74dd6ff92049c571be9e402.log,/var/log/containers/fluentd-cloudwatch-bwdpf_amazon-cloudwatch_update-log-driver-a7799851e03ac287f48cbc63552c5b31016106061ba40493ad644e8a10016e62.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-eks-nodeagent-2a82275bdf85fdb8ac57a6d9e4c927919eb8472e10ffaf77a0290c291111d629.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-eks-nodeagent-a410bd11314ce2fff148d5effd863b8502f0aadf4d492c94c5d841c388b927f4.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-node-0f0417f969145e80e9de2474148256bf009ac84094d26453c53fd5c1c1b0ad6d.log,/var/log/containers/aws-node-9b2rk_kube-system_aws-vpc-cni-init-ffcd1ff811ff67d406fe64096ef05cd9db75666ed1c8efbfbd303f7d09e3c95e.log,/var/log/containers/kube-proxy-4xl5d_kube-system_kube-proxy-32285f83bc32feb2f06700f235ff9db332b23c355b1b7c17b9deaab4a3bcf531.log,/var/log/containers/kube-proxy-4xl5d_kube-system_kube-proxy-a3726048ebd5dceb76fe36e6fadeff5010c6e242aef6bc8f73f4e935a1f4f88c.log,/var/log/containers/coredns-787cb67946-c7jg2_kube-system_coredns-170f21c4cd43ac571eadd5d2f7992734ac46ef62cfca08ae3b4dd9b0bcb7657c.log,/var/log/containers/coredns-787cb67946-c7jg2_kube-system_coredns-cd01a35e8ddbb4255538b165a64aede38b23cc6926a02dc606f7a568edd3a54d.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-attacher-d572d6f311a78a938f22648838d5b85c7c757c0b4cfba2d23f88721a4d969181.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-provisioner-8bb2b99746ddac4a5c72285e2a887bad3d733c5ad66e4f139326a5d8e3bca70e.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_csi-resizer-8ea3c5ce40e31197c5f1f1b922a9b976a5f6bffe499c4a4c6b6db468bc2a421d.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_ebs-plugin-dc900b9e6db16ea65db1bad89d640664140423a92868735f45e1389af16a4233.log,/var/log/containers/ebs-csi-controller-5ddc98b494-zksgf_kube-system_liveness-probe-ddb3d10390ebe8b9457ffddf7e375e4d5d42ae9b7c3d0f52f94baa459527f2fd.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_ebs-plugin-922bec251cadd0bc8c39edddceedaa48fc978968533bef0e47f4cfe1a9bc06b7.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_ebs-plugin-acb6c394d637726269f1fd5ea9818ecc1706596091338e60a4d3720d1e39deac.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_liveness-probe-3ef28982a1e8ed79e8500e05a07f203af6f379f4cd10f31d0dcbe30649271b68.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_liveness-probe-7fb635bdc56be11e79798b4e93150a933da72a0e5c17c13ab04e542ee474b651.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_node-driver-registrar-3dbefb298de8507fced55cfa673fc5513c4b9aecfcefb864196de4885bc180b9.log,/var/log/containers/ebs-csi-node-8w97r_kube-system_node-driver-registrar-cf3ab228b12f1509984a0fc9ece0cb77672cd535936bf7aff366ffdce70cd4b6.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-27e3fe2cdbb873aef975b154c8007f769c5992b59226c8c3f059db1dc197ab4a.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-6b685d7c878bed82856f3adb5a4cc0587f114cc3af38e378504540166215c69a.log,/var/log/containers/eks-pod-identity-agent-lkbzw_kube-system_eks-pod-identity-agent-init-3f779997a0b284a999b0505f1424a4b30af12d143a2a243a74dde7e2c9bd0de9.log,/var/log/containers/prometheus-0_lens-metrics_chown-394770bcd616d0c3d8380fcdbd07ca09fc00738fe17e5f15e5315c9d17312e25.log,/var/log/containers/prometheus-0_lens-metrics_prometheus-e713ff6ca1cb5d4e3d09fb1c07d70f4778efe32f94a4a4f89c7d5e3086ed866b.log
2024-11-02 17:27:54 +0000 [info]: #0 [filter_kube_metadata_host] stats - namespace_cache_size: 0, pod_cache_size: 0
2024-11-02 17:27:54 +0000 [info]: #0 [filter_kube_metadata_host] stats - namespace_cache_size: 0, pod_cache_size: 0
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-hw4ds_default_logger-aba43bbd009d1652e1961dbd30ed45f09e337bfb42d3fa247b12fde7af248909.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-jtxmz_default_logger-742ba4e5339168b7b5442745705bbfed1d93c832027ca0c680b193c9c62e796f.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-kmrlv_default_logger-7682a4b64550055203e19ff9387b686e316fe4e5e7884b720dede3692659c686.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-ptf4k_default_logger-88c30f214da39c81d5fc04466eacddf79278dcd9f99402e5c051243e26b7218f.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-rnm4s_default_logger-df9566f71c1fd7ab074850d94ee4771ea24d9b653599a61cce791f7e221224c2.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-vvrtx_default_logger-37eb38772106129b0925b5fdb8bc20f378c6156ef510d787ec35c57fd3bd68bc.log failed. Continuing without tailing it.
2024-11-02 17:27:59 +0000 [warn]: #0 stat() for /var/log/containers/logger-deployment-57cc6745c7-z9cxt_default_logger-c49720681936856bf6d2df5df3f35561a56d62f4c6a7d65aea8c7e0d70c37ad8.log failed. Continuing without tailing it.
Additional context
No response
Consistently seeing the following errors in the logs (changed the wait time to 60s):
2024-11-04 15:27:48 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log; waiting 60.0 seconds
2024-11-04 15:27:48 +0000 [warn]: #0 [in_tail_container_logs] Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log" inode=100695028 inode_in_pos_file=0
2024-11-04 15:27:48 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log; waiting 60.0 seconds
2024-11-04 15:27:48 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log
Contents of containers.log.pos file:
/var/log/containers/karpenter-76785c6874-gjsjq_karpenter-system_controller-2008ed03f1b7010e3a10bd6249585a91ea4f52b7bb807abdbffee2012e3634e5.log 0000000000009870 00000000025000d3
/var/log/containers/aws-guardduty-agent-kct6q_amazon-guardduty_aws-guardduty-agent-ce5502b765a04b99c5bc04c9cb3d110d6be023626430780c03c0df7ac25360fb.log 0000000000000e1f 000000000070bb0f
/var/log/containers/aws-guardduty-agent-kct6q_amazon-guardduty_aws-guardduty-agent-7b219cd69b2abd4809d569dd8810052a2c1cc2c139f42589b879db518fb42c98.log 0000000000000e1f 000000000070c062
/var/log/containers/karpenter-76785c6874-gjsjq_karpenter-system_controller-b7642902d9be8cf37a8f2e0e05bf858cdaa6e226a89947538d3856bf25d669a4.log 0000000000009018 00000000025000c1
/var/log/containers/node-exporter-tvjg5_lens-metrics_node-exporter-429a1e98cabdf9227e3d222649c64cbd37200d42148f0aa3c461a6293d25c57f.log 0000000000001fc2 0000000002e00b5e
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log ffffffffffffffff 0000000006007bf4
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log ffffffffffffffff 0000000006007bf9
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log ffffffffffffffff 0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log ffffffffffffffff 0000000006007bfb
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log ffffffffffffffff 0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log ffffffffffffffff 0000000006007bfc
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log ffffffffffffffff 0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log ffffffffffffffff 0000000006007bfd
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log ffffffffffffffff 0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log ffffffffffffffff 0000000006007bfb
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log ffffffffffffffff 0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log 0000000000b83ff5 0000000006007bfc
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log 0000000000000000 0000000000000000
/var/log/containers/logger-deployment-57cc6745c7-4c4fb_default_logger-6712c2913db370d75ab57ea84fadb27351e7fc6841ee0005f313ca2df38e44a2.log 00000000003865bb 0000000006007bfd
Thanks for this report. We need to figure out the possible cause. I will investigate this weekend.
Thanks. I've tried different combinations of settings since opening this issue, e.g. using a file buffer, increasing the chunk size, increasing the mem/CPU allocated to the fluentd daemonset, etc. None of them seems to have an impact on Fluentd's ability to tail the logs. It's as if it's losing track of the files it's supposed to tail. I have the notebook I've been using to find gaps in the sequence. Let me know if you want me to post it here.
@daipom I just ran a test where I set the kubelet's containerLogMaxSize to 50Mi (the default is 10Mi). After doing that I saw zero log loss. I'm not totally sure why that would be. My only guess is that the files are being rotated less often and so there are fewer files for fluentd to keep track of.
@daipom Do you think increasing the number of workers and allocating them to source block for @type tail would help with smaller log files?
I tried it briefly at my local environment, but I could not reproduce this. Do we need Kubernetes to reproduce it?
@jicowan Can you reproduce this without Kubernetes?
I only tied this on k8s. I ran multiple replicas of it (at least 10). When the logs grew to 10MB, they were rotated by the kubelet. That's where I saw the issue. Fluentd lost track of the inodes because the files were being rotated so quickly.
@jicowan We are trying to reproduce this issue. Could you please tell us how to reproduce this in detail? I can run the node with the test application, but I don't know how to collect the output. Do we need another Fluetnd node to reproduce this? Or should we use sidecar?
I think I need to have a file like /var/log/containers/... and collect it by in_tail, but I don't know how to do that.
If I set up a pod as in To Reproduce, the logs will be output to standard output.
Sorry I'm not familiar with K8s, but I need a detailed procedure to reproduce this.
First you need a Kubernetes cluster (try not to use KIND, MiniKube, or another single node version of Kubernetes). Then you need to install the Fluentd DaemonSet. You can download the manifests from here. I used the version for Amazon CloudWatch, but you can use a different backend if you like. So long as it can absorb the volume of logs that you're sending to it, the choice of backend shouldn't effect the results of the tests. The default log file size is 10MB. At 10MB the kubelet (the Kubernetes "agent") will rotate the log file.
You can use the Kubernetes Deployment I created to deploy the logging application:
apiVersion: apps/v1
kind: Deployment
metadata:
name: logger-deployment
labels:
app: logger
spec:
replicas: 1 # Adjust the number of replicas as needed
selector:
matchLabels:
app: logger
template:
metadata:
labels:
app: logger
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- logger
topologyKey: "kubernetes.io/hostname"
containers:
- name: logger
image: jicowan/logger:v3.0
resources:
requests:
cpu: 4
memory: 128Mi
limits:
cpu: 4
memory: 256Mi
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
The configuration for Fluentd is typically stored in a ConfigMap. If this isn't descriptive enough, I can walk you through the configuration during a web conference.
I can't verify this is happening yet, but it may be that the files are being rotated so fast that fluentd doesn't have enough time to read them before they are compressed. As the kubelet rotates the logs, it renames the file 0.log to 0.log.
I'm trying to reproduce on local environment usingminikube.
However, I can't reproduce it, yet.
I'd like to know what is your environment on.
Is your environment on AWS?
I'm going to try to reproduce on that environment.
Yes, the environment was on AWS. You can use this eksctl configuration file to provision a similar environment. You can adjust the maximum size of the log file by changing the value of containerLogMaxSize. The default is 10Mi. The default containerLogMaxWorkers is 1. I also changed the storage type from gp3 to io1 because i was using a file buffer and wanted disk with better IO characteristics. You can change it back to gp3 if you want.
# An advanced example of ClusterConfig object with customised nodegroups:
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: logging
region: us-west-2
version: "1.30"
nodeGroups:
- name: ng3
instanceType: m5.4xlarge
desiredCapacity: 2
privateNetworking: true
ssh:
enableSsm: true
kubeletExtraConfig:
containerLogMaxWorkers: 5
containerLogMaxSize: "50Mi"
ebsOptimized: true
volumeType: io1
iam:
withOIDC: true
accessConfig:
authenticationMode: API_AND_CONFIG_MAP
vpc:
nat:
gateway: Single
If you send the logs to CloudWatch, you'll need to use IRSA or pod identities to assign an IAM role to the pod.
If the log files are rotated in a shorter time than specified in refresh_interval, it may not be handled properly.
The workaround would be to shorten the refresh_interval, or increase the size limit of the rotation file to extend the rotation time.
If you increse the size limit of the rotation file, due the fact fluentd read slower than the logs are written, in one moment you lost one of the rotation files.
The refresh interval is set to 1 @Watson1978. @slopezxrd I can't verify this yet, but if Fluentd is unable to read the logs fast enough, they will get compressed [by the Kubelet] before it has had time to read the whole file which will result in lost logs. If you look at the code for the Kubelet, it has already accounted for this once before, https://github.com/kubernetes/kubernetes/blob/f1b3fdf7e6d40714b1a43757221832aa1c4a49d1/pkg/kubelet/logs/container_log_manager.go#L451-L472.
Sorry for late response. I have been investigated this issue for a while.
Now I recommend following configuration for running on kubernetes.
Recommend configuration
<source>
@type tail
follow_inodes false
rotate_wait 0
path /var/log/containers/...path to your app logs...
...
</source>
follow_inodes false
With follow_inodes false, if a log file rotation is detected, a new log file may not be read until the refresh_interval has elapsed.
I recommend to set follow_inodes true to avoid this behavior.
rotate_wait 0
With follow_inodes false, it will display many warning message of Skip update_watcher because watcher has been already updated....
The rotate_wait 0 might suppress this message and you can ignore the Skip update_watcher because watcher has been already updated... warning message.
There is no problem with Fluentd's behavior when that message is displayed.
path /var/log/containers/...path to your app logs...
There is symbolic link to the application log under /var/log/containers/.
It would be sufficient to use that as the read target.
Warning messages
You can ignore the following warning messages. There is no problem with Fluentd's behavior when that message is displayed.
Skip update_watcher because watcher has been already updated...Could not follow a file (inode: 101712298) because an existing watcher for that filepath follows a different inode...
I will fix these warning messages or relax warning log level.
@Watson1978 Thanks for investigating!
So, the problem is that the rotation occurs at very high speed.
In that case, it is certainly better to set follow_inodes false (default) and rotate_wait 0.
Warning messages
You can ignore the following warning messages. There is no problem with Fluentd's behavior when that message is displayed.
* `Skip update_watcher because watcher has been already updated...` * `Could not follow a file (inode: 101712298) because an existing watcher for that filepath follows a different inode...`I will fix these warning messages or relax warning log level.
Yes! There was a bug in older versions that could cause in_tail collection to stop without an error log. These warning logs were placed at that time as a precaution.
In this case, the fast rotation causes this warning, but there seems to be no problem with the collection. So, as @Watson1978 says, you can ignore these warnings. These logs should be fixed, considering the case of fast rotations.
During heavy log volumes, e.g. >10k log entries per second, fluentd consistently drops logs.
Hmm, does setting follow_inodes false and rotate_wait 0 causes log lost?
Looks like we need to investigate log lost problem more.
I've tried setting follow_inodes to true and false. I see log loss in both instances. My refresh_interval is currently set to 1, my rotate_wait is set to 0. I think Fluentd is falling so far behind when tailing the logs that the log file is getting compressed before it can finish reading the file.
@jicowan Sorry for my late response. I have investigated this issue and I have found the cause.
As a conclusion, if there is a log file that receives a high volume of logs faster than in_tail can read, it makes the collection unstable.
For such files, please separate the in_tail setting to multiple <source>, knowing that it will be unstable.
Do not mix settings with other normal-size file collections into one <source>.
In addition, if such files exist, the following settings will help stabilize the collection by limiting the amount of collection per unit of time, but please note that it will reduce totall throughput.
If such files exist, it will be fundamentally challenging to prevent log loss, but I would be willing to consider possible improvements in future versions.
Here are the details.
Cause
- If a log file receives a high volume of logs faster than
in_tailcan read, it becomes busy and causes delays in other processes on thatin_tailconfig. - If log rotation detection is too slow and multiple rotations occur in the meantime, some files may be missed and not collected.
Workarounds
- Split
in_tailconfig into smaller:- Avoid specifying too many log targets in a single
path. - This can improve performance since each
in_tailconfig runs in its own thread.
- Avoid specifying too many log targets in a single
- Use
read_bytes_limit_per_second:- Prevents large log files from blocking other processing.
- Note: May reduce overall throughput.
- Use the
<group>directive:- Similar to
read_bytes_limit_per_second, but more powerful. - Allows limiting logs by group, such as per pod.
- Useful to prevent high volume pods from affecting log collection from others.
- Similar to
Other Notes
follow_inodesis usually not relevant:- It's generally not needed unless there's a specific reason.
- Avoid using it if the log rotation interval is shorter than
refresh_interval.
rotate_waitalso has limited effect, but should be set lower thanrefresh_interval.- Frequent warning logs like the following may indicate the
in_tailis too busy that it cannot collect all data stably:Could not follow a file ...stat() for ... failed. Continuing without tailing it.Skip update_watcher because ...
Thanks for investigating this issue @daipom. I don't think we can use <group> here because the container runtime is programmed to write logs to a file, e.g. /var/log/containers/container_id/log_0.log, and the kubelet is rotates that file (renames the file when it reaches a particular size, log_1.log, and compresses it after 2 rotations). I could see using <group> if I were only interested in capturing the logs of a few containers, but for my use case I need to capture all logs from all containers. Can you make a single instance of in_tail multi-threaded so the logs are read faster? I don't see this issue with Fluent Bit. I assume it's because it's written in C++.
I see. Thanks. Then, currently, read_bytes_limit_per_second setting will help stabilize the collection.
Can you make a single instance of in_tail multi-threaded so the logs are read faster? I don't see this issue with Fluent Bit. I assume it's because it's written in C++.
That would be a much larger revision. And it must be made very carefully to avoid making new bugs.
Besides, I don't know how much multi-threading would increase the overall speed.
Since Ruby has GVL, if the reading cannot keep up with the current implementation, then multi-threading would not help increase speed very much.
It could be the same as setting read_bytes_limit_per_second.
I will do a little research to see if there are improvements that might be possible for v1.19, but I don't think we will be able to make a very big fix in time.
If you have too many logs for a particular file, you can use the workaround described in https://github.com/fluent/fluentd/issues/4693#issuecomment-2796483918.
If there are too many logs overall to be read, then we can use Fluentd's multi-worker feature, or run multiple Fluentd instances, or reduce the amount of logs generated.
Now this issue is to discuss improvements in the following points. (Maybe we should create another issue, but for now...)
As a conclusion, if there is a log file that receives a high volume of logs faster than
in_tailcan read, it makes the collection unstable. For such files, please separate thein_tailsetting to multiple<source>, knowing that it will be unstable. Do not mix settings with other normal-size file collections into one<source>.In addition, if such files exist, the following settings will help stabilize the collection by limiting the amount of collection per unit of time, but please note that it will reduce totall throughput.
If such files exist, it will be fundamentally challenging to prevent log loss, but I would be willing to consider possible improvements in future versions.
Cause
- If a log file receives a high volume of logs faster than
in_tailcan read, it becomes busy and causes delays in other processes on thatin_tailconfig.- If log rotation detection is too slow and multiple rotations occur in the meantime, some files may be missed and not collected.
Can you make a single instance of in_tail multi-threaded so the logs are read faster? I don't see this issue with Fluent Bit. I assume it's because it's written in C++.
That would be a much larger revision. And it must be made very carefully to avoid making new bugs.
Besides, I don't know how much multi-threading would increase the overall speed. Since Ruby has GVL, if the reading cannot keep up with the current implementation, then multi-threading would not help increase speed very much. It could be the same as setting
read_bytes_limit_per_second.I will do a little research to see if there are improvements that might be possible for v1.19, but I don't think we will be able to make a very big fix in time.
Ran into a similar issue.
I increased the number of workers and broke up the log file reading into separate workers explicitly.
Before I did this, fluentd was not using more than 1000m (or the equivalent of 1 core), after I saw it start using more than >1000m
example config:
<system>
workers 3
</system>
<worker 0>
<source>
@type tail
path /var/log/containers/*.log
exclude_path ["/var/log/containers/*heavyloggingcontainer*.log"]
pos_file /opt/fluentd-containers.pos
tag kubernetes.*
read_from_head true
refresh_interval 5
rotate_wait 0
follow_inodes true
<parse>
@type cri
</parse>
</source>
</worker>
<worker 1>
<source>
@type tail
path /var/log/containers/*heavyloggingcontainer*.log
pos_file /opt/fluentd-containers-1.pos
tag kubernetes.*
read_from_head true
refresh_interval 5
rotate_wait 0
follow_inodes true
<parse>
@type cri
</parse>
</source>
</worker>