fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

log missing using tail input plugin

Open wangyuan0916 opened this issue 3 years ago • 6 comments

Bug Report

Describe the bug I use tail input plugin to gather and foward container logs in kubernetes cluster with this config file: [INPUT] Name tail Tag kube.* Path /var/log/containers/*.log DB /var/log/flb_kube.db Mem_Buf_Limit 5MB Skip_Long_Lines On Refresh_Interval 10 multiline.parser docker, cri Read_from_Head true

when a new file is created, there's a log in fluent-bit pod like: 2022-02-22T07:52:06.428965848Z stderr F [2022/02/22 07:52:06] [debug] [input:tail:tail.0] inode=262387 with offset=1244 appended as /var/log/containers/log1-ghsmu-syslog-log-sinklog-emitter--1-mjmzl_ns_log-emitter-af21dc3aa54aeacd19b3ba295bdc1d260e27f80e63e931a9e52275dfaa83e2d0.log

It is strange that this is a new file, but the offset is not 0, which will actually lead fluentbit to read from offset=1244 and miss logs ahead of this offset. I found fluentbit used inode to check from db to get this offset. Maybe inode=262387 is used before when other file was created but reuse this number when 'log1-ghsmu-syslog-log-sinklog-emitter' is created. I don't think it is by design when Read_from_Head=true.

FB version : 1.8.10

wangyuan0916 avatar Feb 22 '22 03:02 wangyuan0916

Using the same configuration as above, I found if the log emitter job print just a few lines then complete quickly, fluent-bit cannot observe the existence of this new file, especially a lot of new files creating at the same time. Some files are successfully observed by fluent-bit, because fluent-bit prints logs like: [input:tail:tail.0] inotify_fs_add(): inode=262450 watch_fd=7 name=/var/log/containers/log1-xjbhf-syslog-log-sinklog-emitter--1-jvgrj_ns_log-emitter-cd1985fee31c955e43276682038c9c60509996d387dc91ebcd52ab013671b341.log While for other files not observed by fluent-bit, there are no logs like this.

wangyuan0916 avatar Feb 23 '22 01:02 wangyuan0916

With td-agent-bit version 1.7.8. I am seeing same issue (meaning offset is not 0 for newly created files which is causing log lines getting missed) on both scenarios 1) fluent-bit process started after log files being created (Read_from_Head =true) in config 2) log files created newly after fluent-bit process running for while. Here is the fluent-bit config file [SERVICE] #Default service flush interval is 15 seconds ${SERVICE_FLUSH_INTERVAL} HTTP_Server Off Daemon Off storage.path /var/opt/microsoft/docker-cimprov/state/flbstore/ storage.sync normal storage.checksum off storage.backlog.mem_limit 10M Log_Level info Parsers_File /etc/opt/microsoft/docker-cimprov/azm-containers-parser.conf Log_File /var/opt/microsoft/docker-cimprov/log/fluent-bit.log

[INPUT] Name tail Tag oms.container.log.la.* Path ${AZMON_LOG_TAIL_PATH} Read_from_Head true DB /var/log/omsagent-fblogs.db DB.Sync Off Parser docker ${TAIL_MEM_BUF_LIMIT} ${TAIL_BUFFER_CHUNK_SIZE} ${TAIL_BUFFER_MAX_SIZE} Rotate_Wait 20 Refresh_Interval 30 Path_Key filepath Skip_Long_Lines On Ignore_Older 5m Exclude_Path ${AZMON_CLUSTER_LOG_TAIL_EXCLUDE_PATH}

Repro - Deploy this job in 10 to 20 test namespaces and observe fluent-bit gets the logs from few containers which has offset=0 in the fluent-bit log file and logs missed from the containers which has non-zero offset. The issue, fluent-bit doesnt start with 0 offset for newly created files.

apiVersion: batch/v1 kind: Job metadata: name: pi-with-ttl spec: template: spec: containers: - name: command-demo-container image: debian command: ["printenv"] args: ["HOSTNAME"] nodeSelector: kubernetes.io/os: linux restartPolicy: Never

ganga1980 avatar Apr 25 '22 19:04 ganga1980

This issue happens with latest version i.e. 1.9.3 of the Fluent bit as well. I have upgraded fluent bit version from 1.7.8 to 1.9.3 and issue persists and issue here is the collision/reuse of inode value. If the inode value of newly created file matches with existing entry in the fluent-bit DB for deleted different file (or potentially this can be existed file as well) then the fluent-bit tail plugin gets wrong offset from DB which causes the logs getting missed.

image

root@omsagent-k567l:~# dpkg -l | grep td-agent-bit ii td-agent-bit 1.9.3 amd64 Fast data collector for Linux

@edsiper , Seems like this issue still exists in 1.9.3 even with fix https://github.com/fluent/fluent-bit/commit/f3293454f98db75c5805d6484db52d8c128cbc1b. I didnt see this in the release notes of 1.9.3 so wondering whether this fix included in 1.9.3 https://fluentbit.io/announcements/v1.9.3/ release or not? Why the fluent-bit (with version 1.9.3 after upgrade) not purging deleted file entries from the DB? is this expected behavior for the upgrade scenario? or is this issue fixed only in 1.9.3 (meaning, not upgrade scenario) ?

ganga1980 avatar May 02 '22 23:05 ganga1980

@edsiper, appreciate your inputs on this and let me know if you need any additional details.

ganga1980 avatar May 04 '22 03:05 ganga1980

/track @ganga1980

ganga1980 avatar May 09 '22 03:05 ganga1980

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Aug 08 '22 02:08 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Aug 14 '22 02:08 github-actions[bot]

/reopen

JeffLuoo avatar Apr 26 '23 19:04 JeffLuoo

@ganga1980 I have the same problem. Have you solved this problem? If so, can you give me some hints? Thank you so much!

gorexlv avatar May 29 '23 08:05 gorexlv

Hi, All,

Yes, we got the similar problem. Previously, it is easier to reproduce this issue in our lab. after apply below code, no log truncated any more. @ganga1980 and @edsiper would you review it?

/* Delete file entry from the database */
int flb_tail_db_file_delete_by_id(uint64_t id,
                            const char *file_name,
                            struct flb_tail_config *ctx)
{
    int ret;

    /* Bind parameters */
    sqlite3_bind_int64(ctx->stmt_delete_file, 1, id);
    ret = sqlite3_step(ctx->stmt_delete_file);

    sqlite3_clear_bindings(ctx->stmt_delete_file);
    sqlite3_reset(ctx->stmt_delete_file);

    if (ret != SQLITE_DONE) {
        flb_plg_error(ctx->ins, "db: error deleting entry from database: %s",
                      file_name);
        return -1;
    }

    flb_plg_debug(ctx->ins, "db: file deleted from database: %s", file_name);
    return 0;
}

static int db_file_exists(struct flb_tail_file *file,
                          struct flb_tail_config *ctx,
                          uint64_t *id, uint64_t *inode, off_t *offset)
{
    int ret;
    int exists = FLB_FALSE;

    /* Bind parameters */
    sqlite3_bind_int64(ctx->stmt_get_file, 1, file->inode);
    ret = sqlite3_step(ctx->stmt_get_file);

    if (ret == SQLITE_ROW) {
        exists = FLB_TRUE;

        /* id: column 0 */
        *id = sqlite3_column_int64(ctx->stmt_get_file, 0);

        /* name: column 1 */
        const char *existed_file_name = sqlite3_column_text(ctx->stmt_get_file, 1);

        /* offset: column 2 */
        *offset = sqlite3_column_int64(ctx->stmt_get_file, 2);

        /* inode: column 3 */
        *inode = sqlite3_column_int64(ctx->stmt_get_file, 3);

        /* if file name is different from the sqllite staled filename, which means this is a a new file
            shared the same inodes. It is necessary to remove the staled file record in sqllite. 
            It is a new file, just read from head, without offset */
        if (strcmp(existed_file_name, file->name) != 0) {
            flb_tail_db_file_delete_by_id(*id, existed_file_name, ctx);
            exists = FLB_FALSE;
        }
    }
    else if (ret == SQLITE_DONE) {
        /* all good */
    }
    else {
        exists = -1;
    }

    sqlite3_clear_bindings(ctx->stmt_get_file);
    sqlite3_reset(ctx->stmt_get_file);

    return exists;

}

hsingli20 avatar Jun 09 '23 03:06 hsingli20

@hsingli20 You can follow https://docs.fluentbit.io/manual/development/developer-guide and make a pull request directly. It's better than a comment review : )

JeffLuoo avatar Jun 12 '23 18:06 JeffLuoo