fluent-bit Tail input plugin not fetching all logs when files are rotated quickly under heavy load

Bug Report

Describe the bug Tail input plugin not able to tail files when the file rotation happens. Used a container that generates 1,000,000 lines that log it to stdout. configured fluent-bit to tail the logs files and print it to standard output. The tail plugin is only able to get ~200000 lines (only from the first file). The plugin is not able to fetch lines after the rotation happens under heavy load.

To Reproduce

Example log message if applicable:

{"log":"132.224.216.5 - roob8562 [17/Apr/2020:19:26:55 +0000] \"GET /architect HTTP/1.1\" 403 3271\n","stream":"stdout","time":"2020-04-17T19:27:02.669131074Z"}

Steps to reproduce the problem:

Expected behavior Generate 1,000,000 logs lines and log it to standard output and expect fluent-bit to tail and fetch all those 1,000,000 lines.

Your Environment

Version used: 1.4.2
Configuration:
Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.15.10
Server type and version:
Operating System and version: Linux
Filters and plugins: Tail and Standard Output

Additional context

fluent-bit configuration

apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluent-bit
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: fluent-bit-read
rules:
- apiGroups: [""]
  resources:
  - namespaces
  - pods
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: fluent-bit-read
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: fluent-bit-read
subjects:
- kind: ServiceAccount
  name: fluent-bit
  namespace: default
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config-ds
  namespace: default
  labels:
    k8s-app: fluent-bit
data:
  # Configuration files: server, input, filters and output
  # ======================================================
  # 
  
  fluent-bit.conf: |
    [SERVICE]
        Flush                     1
        Log_Level                 info
        Daemon                    off
        Parsers_File              parsers.conf
        HTTP_Server               On
        HTTP_Listen               0.0.0.0
        HTTP_Port                 2020
        storage.path              /var/log/flb-storage/
        storage.sync              normal
        storage.checksum          off
        storage.backlog.mem_limit 500M

    @INCLUDE input-kubernetes.conf
    @INCLUDE output-stdout.conf

  input-kubernetes.conf: |
    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/plaintext-*.log
        DB                /var/log/flb_kube.db
        Parser            docker
        Mem_Buf_Limit     500MB
        storage.type      filesystem
        Skip_Long_Lines   On
        Refresh_Interval  1
        Rotate_Wait       10

  output-stdout.conf: |
    [OUTPUT]
        Name                          stdout
        Match                         kube.*
        Format                        json_lines
        json_date_format              iso8601
        
  parsers.conf: |
    [PARSER]
        Name          stdout
        Format        json
        Time_Key      time
        Time_Format   %Y-%m-%dT%H:%M:%S.%L
        Time_Keep     On
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fl-exp
  namespace: default
  labels:
    k8s-app: fluent-bit-logging
    version: v1
    kubernetes.io/cluster-service: "true"
spec:
  selector:
    matchLabels:
      k8s-app: fluent-bit-logging
  template:
    metadata:
      labels:
        k8s-app: fluent-bit-logging
        version: v1
        kubernetes.io/cluster-service: "true"
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "2020"
        prometheus.io/path: /api/v1/metrics/prometheus
    spec:
      containers:
      - name: fluent-bit
        image: fluent/fluent-bit:1.4.2
        imagePullPolicy: IfNotPresent
        ports:
          - containerPort: 2020
        resources:
          requests:
            cpu: 1000m
            memory: 1024Mi
          limits:
            cpu: 1000m
            memory: 1024Mi
        env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: "elasticsearch"
        - name: FLUENT_ELASTICSEARCH_PORT
          value: "9200"
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: fluent-bit-config
          mountPath: /fluent-bit/etc/
      terminationGracePeriodSeconds: 10
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: fluent-bit-config
        configMap:
          name: fluent-bit-config-ds
      serviceAccountName: fluent-bit
      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule
      - operator: "Exists"
        effect: "NoExecute"
      - operator: "Exists"
        effect: "NoSchedule"

Metrics from one of the run

# TYPE fluentbit_input_bytes_total counter
fluentbit_input_bytes_total{name="storage_backlog.1"} 0 1587153190798
fluentbit_input_bytes_total{name="tail.0"} 109649569 1587153190798
# TYPE fluentbit_input_files_closed_total counter
fluentbit_input_files_closed_total{name="tail.0"} 1 1587153190798
# TYPE fluentbit_input_files_opened_total counter
fluentbit_input_files_opened_total{name="tail.0"} 2 1587153190798
# TYPE fluentbit_input_files_rotated_total counter
fluentbit_input_files_rotated_total{name="tail.0"} 0 1587153190798
# TYPE fluentbit_input_records_total counter
fluentbit_input_records_total{name="storage_backlog.1"} 0 1587153190798
fluentbit_input_records_total{name="tail.0"} 567607 1587153190798
# TYPE fluentbit_output_errors_total counter
fluentbit_output_errors_total{name="stdout.0"} 0 1587153190798
# TYPE fluentbit_output_proc_bytes_total counter
fluentbit_output_proc_bytes_total{name="stdout.0"} 109649569 1587153190798
# TYPE fluentbit_output_proc_records_total counter
fluentbit_output_proc_records_total{name="stdout.0"} 567607 1587153190798
# TYPE fluentbit_output_retries_failed_total counter
fluentbit_output_retries_failed_total{name="stdout.0"} 0 1587153190798
# TYPE fluentbit_output_retries_total counter
fluentbit_output_retries_total{name="stdout.0"} 0 1587153190798
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1587151599

The plugin is not able keep up if fluentbit_input_bytes_total rate goes beyond 100 Kb per second. used the query rate(fluentbit_input_bytes_total{name="tail.0"}[5m])

Apr 17 '20 19:04 thepythonista

We encounter the same behaviour. If logs are written under high load to STDOUT fluent-bit stops working on docker log rotation and resume on the next rotation. So for one rotation intervall it works, for the next interval not and so on.

Further we did some small tests, which leads us to say the problem occurs between 1000-1250 loglines per second per node. The log size seem not to matter, because we tested with 132 and 446 characters, with the same behaviour.

Environment:

Fluent-bit: 1.3.11
Docker 19.03.6 (the rotation is configured to 32MB)
Kubernetes: 1.16.6
Operating System and version: Ubuntu 18.04
Plugin config:

[INPUT]
    Name              tail
    Tag               kube.*
    Path              /var/log/containers/*.log
    Parser            docker
    DB                /var/log/flb_apps.db
    Mem_Buf_Limit     64MB
    Skip_Long_Lines   On
    Refresh_Interval  10

Tests

1 pod with ~1000msg/s (132 and 446 characters) -> works fine
1 pod with ~1250msg/s (132 characters) -> the described problems
2 pod with ~1000msg/s each (132 characters) -> the described problems
8 pod with ~250msg/s each (132 characters) -> the described problems

It seems that ~1000msg/s is valid for 1 host, regardless of the count of the pods, which produce that logs. Further we played around with the options for the tail plugin in all combination with no effect.

Apr 23 '20 17:04 sneben

how often rotation is happening ?, what are the chances the file get rotated in between a refresh_interval ?

Apr 24 '20 04:04 edsiper

I did 3 tests. The first 2 with 32MB rotation size:

1 pod / 446 chars/line / ~1000msg/s : 1m 8s until first rotation -> WORKS
1 pod / 136 chars/line / ~2000msg/s : 1m 34s until first rotation -> DON'T WORK

And one with 128MB rotation size:

1 pod / 136 chars/line / ~2000msg/s : 6m 7s until first rotation -> DON'T WORK

I set up Refresh_Interval to 1s for some tests as well, which had no effect.

Apr 24 '20 09:04 sneben

Please note this happens only if the tail plugin is configured to tail files from /var/log/containers which have symlinks.

In one of my other tests runs in minikube, docker daeamon.json was configured to rotate every 50mb and keep the latest 5 files. 1000000 lines were generated which took more than 10 seconds and my refresh interval was 1 second.

ls -lh --time-style=full-iso 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b/

-rw-r----- 1 root root  25M 2020-04-24 19:46:31.559009075 +0000 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b-json.log
-rw-r----- 1 root root  48M 2020-04-24 19:46:16.625763749 +0000 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b-json.log.1
-rw-r----- 1 root root  48M 2020-04-24 19:45:49.753334243 +0000 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b-json.log.2
-rw-r----- 1 root root  48M 2020-04-24 19:45:19.867398737 +0000 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b-json.log.3

There is a reasonable gap of time between when the files are rotated. Only the contents of 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b-json.log.3 were accessible. The rest of the files were not tailed.

Input Configuration:

[INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/plaintext-*.log
        DB                /var/log/flb_kube.db
        Parser            docker
        Mem_Buf_Limit     500MB
        Skip_Long_Lines   On
        Refresh_Interval  1
        Rotate_Wait       10

Apr 24 '20 19:04 thepythonista

Hey @thepythonista, @sneben ,

I am facing the similar kinda issue where running fluent bit to print these logs to stdout of k8 pod and have noticed if files are completed i mean application is done writing logs to the files then fluent bit parse the complete file and print it to stdout but if application is still writing logs to the file then fluent bit reads upto few lines only in one go when it scans the path, after that it never prints the logs which are newly added in the file.

is this related to the above explanation where pod is writing more than 1000msg/s ?

Any suggestion ?

Jun 26 '20 11:06 v-kay

I am seeing the same issue under high scale. Any workarounds/updates?

Dec 12 '20 03:12 rashmichandrashekar

We are experiencing the same issue.

I send in 200k log messages, We are using docker in kubernets. The docker logrotation rotates every 50 MB with 5 total files max. I have increased Rotate_Wait to 240 and Refresh_Interval is 10.

Some rotation of files are detected, others not. Later the refresh finds rotated files which also feels wrong because all of them were rotated.

I think the problem could be that docker rotates it by renaming all other files. So when a file is rotated it get index 1 and all other files are renamed where the index is increased by one.Also fluentbit should not try to find these new files since they were rotated.

Not sure how to reproduce yet in a simple manner.

May 12 '21 19:05 fkoehler

I have a similar issue (not sure if it is related or a different issue).

I have a PoC with low volumes (200 events), but Fluent Bit's tail input only fetches the first few log entries in each log file and then hangs. Then, when I restart Fluent Bit, it fetches the rest of the logs from the files. See this stack overflow entry for more details.

May 26 '21 05:05 dewaldmienie

/watch

Jul 01 '21 00:07 ganga1980

I am also running into a similar issue in which log files that are appended to are not captured by tail and sent to ElasticSearch. Decreasing the refresh_interval if anything makes things worse, as td-agent-bit sees the file before anything is added to it and then promptly ignores any future updates.

As anyone in this thread found a workaround?

Aug 05 '21 21:08 pboennig

For my specific issue, I found it was actually a case of #3111, as my logs were stored in AWS EFS. Moving the logs outside of EFS solved the issue.

Aug 06 '21 20:08 pboennig

I've got the same problem as described by @thepythonista. We only want fluentbit to forward our application specific logs so have a tail path configured like this:

/var/log/containers/*_our-app-*.log".

The /var/log/containers dir contains a symlink for each container e.g. xyz-blue-6bc9c664f4-sg8b4_our-app-blue_xyz-blue-5187b73bf7806eb02e9c18b4f8ff9a4fe638d70940e148acb9fab24c30f403cb.log

..which points to another symlink e.g. /var/log/pods/our-app-blue_xyz-blue-6bc9c664f4-sg8b4_9b5687c0-d62c-4e82-af28-c032069eb84c/xyz-blue/0.log

which points to the container log file currently being written to e.g. /var/lib/docker/containers/5187b73bf7806eb02e9c18b4f8ff9a4fe638d70940e148acb9fab24c30f403cb/5187b73bf7806eb02e9c18b4f8ff9a4fe638d70940e148acb9fab24c30f403cb-json.log.

When docker (or json-file) rotates the log file it suffixes the current log file name with .1 and renames the previous .1 file to .2 etc. I'm wondering if the chain of symlinks is preventing fluentbit from accessing or being aware of the *.log.1 file. Under high load we can see that the logs at the end of the *.log.1, *.log.2 etc files have not been forwarded. Assumption is that fluentbit has fallen behind reading logs from the current log file, so when the file is renamed, fluentbit can no longer access it or is unaware of it due to the use of symlinks that only direct it to the current log file.

For what its worth, we are using fluent-bit image 1.7.4, here is my service and output config:

    [SERVICE]
        Flush 1
        Daemon Off
        Log_Level debug
        Parsers_File parsers.conf
        Parsers_File custom_parsers.conf
        #
        # HTTP server for providing metrics endpoint
        #
        HTTP_Server On
        HTTP_Listen 0.0.0.0
        HTTP_Port {{ .Values.service.port }}
        storage.path /var/log/flb-storage/
        storage.sync normal
        storage.checksum off
        storage.backlog.mem.limit 20M

     [INPUT]
        Name             tail
        Path             "/var/log/containers/*_our-app-*.log"
        Exclude_Path     "fluent*.log"
        Parser           docker
        Tag_Regex        (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(? <namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
        Tag              ourcompany.k8s.<namespace_name>.<container_name>.<pod_name>.<docker_id>-
        Refresh_Interval 5
        Mem_Buf_Limit    500MB
        Skip_Long_Lines  On
        DB               /tmp/tail-containers-state.db
        DB.Sync          Normal
        Docker_Mode      On
        storage.type     filesystem
        Rotate_Wait      15

Nov 11 '21 15:11 wlawton

Hi, Is there any specific solution for this issue. I am also facing similar issue where fluentbit is trying to scrape logs from mounted nfs dir. If I restart fluentbit, It will tail the logs again but stops after few mins.

Feb 21 '22 17:02 srideok2

Hello

We have the same issue on 1.8.13

Mar 11 '22 13:03 AlessioCasco

@AlessioCasco , For me, this is resolved after setting Inotify watcher to false. check this link for more details

        [INPUT]
            Name                   tail
            Inotify_Watcher   false

Mar 12 '22 01:03 srideok2

I have almost the same problem. What should the fluent-bit settings be for high loads?

Mar 12 '22 15:03 Himenon

Yes, even for moderate (around 400 logs per sec) loads this happens. Anybody could help here?

Mar 16 '22 21:03 maheskrishnan

We are also affected by this issue for pods as soon as they start logging over 1k messages per second. As soon as the load calms down, log messages are sent as expected, with 10MB docker rotation and following input conf:

[INPUT]
    Name                tail
    Tag                 kube.<namespace_name>.<pod_name>_<container_name>-<docker_id>
    Tag_Regex           ^/var/log/containers/(?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
    Path                /var/log/containers/*.log
    Parser              docker
    DB                  /var/log/flb_kube.db
    Mem_Buf_Limit       10MB
    Skip_Long_Lines     On
    Refresh_Interval    10
    Docker_Mode         On

May 10 '22 10:05 miksercz

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

Aug 09 '22 02:08 github-actions[bot]

Not stale

Aug 09 '22 06:08 aslafy-z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

Nov 08 '22 02:11 github-actions[bot]

No stale

Nov 08 '22 06:11 aslafy-z

I wonder if it may be related to https://github.com/fluent/fluent-bit/pull/6552

Jan 26 '23 09:01 patrick-stephens

I wonder if it may be related to #6552

Any chance you could take a look at it @lecaros or give your perspective here? I'll also take a deeper look later. Any chance you can test again @thepythonista to see if it still happens with the latest version?

Feb 03 '23 02:02 pwhelan

@pwhelan I don't think this is in a release yet but @thepythonista you can test with the latest version of master container built overnight: ghcr.io/fluent/fluent-bit/master:x86_64 (don't use in production obviously).

Feb 03 '23 10:02 patrick-stephens

Agree with @patrick-stephens. Please test this with the latest image (this was initially reported in 1.4.2) and share the repro in case you still see the issue.

Feb 03 '23 13:02 lecaros

We are hitting the same issue using fluent-bit 2.0.9 on kubernetes 1.24 .

Originally our configuration was:

    [INPUT]
        Name tail
        Path /var/log/containers/*.log
        Tag kube.*
        multiline.parser cri
        Mem_Buf_Limit 200MB
        Skip_Long_Lines Off
        DB /var/log/flb-storage/offset.db
        Buffer_Max_Size 1M

but we end up whith this which little bit helps to reduce the issue:

    [INPUT]
        Name tail
        Path /var/log/containers/*.log
        Tag kube.*
        multiline.parser cri
        Mem_Buf_Limit 200MB
        Skip_Long_Lines Off
        DB /var/log/flb-storage/offset.db
        Buffer_Max_Size 1M
        Progress_Check_Interval 1ms
        Inotify_Watcher false
        Rotate_Wait 15
        Refresh_Interval 1
        Buffer_Chunk_Size 2k

We made a small test case which might helps pin down the issue:

At first we got our cluster node under little stress to generete some amount of logs traffic on it. Then we deployed our control container which emit sample logline every 10ms with unique incrementing ID. So at the end we could tell exactly which logs are missing. After the test run has finished we simply checked content of both '0.log' (current) and '0.log. <timestamp>' (last rotated) files on FS for our control container. The '0.log. <timestamp>' file end up with log number "76812" and '0.log' starts with log number "76813". Until now everything seems fine, cri runtime rotate logs correctly. But if we comapare it with elastic there are records between 76803 - 76825 missing. Assuming everytime log file is rotated fluen-bit not process some amount of records (at the end and at the beginning of rotated file). In our test context for the control container it result to only about 74000 of total 80000 logs has been delivered (6000 lost).

Mar 18 '23 00:03 romanfurst

2.0.10 was out as well last week which has some further improvements in it. Threaded input is now possible as well I believe: threaded on

Mar 20 '23 10:03 patrick-stephens

with threaded on we are hiting 7071 . By the way, right now we are running on v2.1.2, and so far, it looks like there is no improvement. The issue with quick file rotation seems to be still relevant.

May 25 '23 17:05 romanfurst

Logs may be lost under high load. Fluent relies on inotify to passively receive file changes, and each notify fluent can only read a few lines. Therefore, under high load, the data source writes files to cause file changes, and fluent only processes a few lines after receiving the notification. Obviously, it cannot keep up with the writing speed.

Jul 21 '23 14:07 liguangcheng

fluent-bit fluent-bit copied to clipboard

Tail input plugin not fetching all logs when files are rotated quickly under heavy load

Bug Report

fluent-bit
fluent-bit copied to clipboard