fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

Multiline parsing - How to properly set value for flush_timeout

Open emmacz opened this issue 1 year ago • 3 comments

Bug Report

Describe the bug We are not able to configure multiline parser to properly catch all related lines into one record. Commonly used 1000 ms for flush_timeout does not work, 50 000 ms looks better but still we see some lines captured as separate record, 500 000 ms it works the best, but we do not know if we can use this high value.

To Reproduce

  • Rubular link if applicable:
  1. start_state regex: https://rubular.com/r/Q1Htb8QfXZF4kx
  2. cont regex: https://rubular.com/r/WTK4kyxXiX660i
  • Example log message if applicable: Correctly produced lines (multiline parser works): 1, 2, 3, 4, 5, 8, 10, 12, 13, 15, Wrong parsing - see lines : 6, 7, 9, 11, 14,16
1 ls_asm_trace: [1706645085.194947000, {"fbpfx_original_log":"*** 2024-01-23T08:02:35.013371+00:00\n*** SESSION ID:(1179.51634) 2024-01-23T08:02:35.0>

2 ls_asm_trace: [1706645085.195222000, {"fbpfx_original_log":"*** 2024-01-23T08:02:35.864047+00:00\n2024-01-23 08:02:35.864 : [GIPCXCPT][GIPCXCPT] gi>

3 ls_asm_trace: [1706645085.444304000, {"fbpfx_original_log":"*** 2024-01-23T08:02:36.949221+00:00\n2024-01-23 08:02:36.949 : kxdam_process_req_msg: >

4 ls_asm_trace: [1706645085.444327000, {"fbpfx_original_log":"*** 2024-01-23T08:14:44.687535+00:00\n2024-01-23 08:14:44.687 : kxdam_alloc_cell: Cell >

5 ls_asm_trace: [1706645086.446246000, {"fbpfx_original_log":"*** 2024-01-23T08:14:46.741290+00:00\n2024-01-23 08:14:46.741 : kxdam_open_cell: cell o>

6 ls_asm_trace: [1706645086.446246000, {"fbpfx_original_log":"2024-01-23 08:14:46.742 : kxdam_dscv_fetch: Discovered disk: DATAC1_CD_11_phx107315exdc>

7 ls_asm_trace: [1706645086.446246000, {"fbpfx_original_log":"2024-01-23 08:14:46.743 : kxdam_slave_wreq_create_one: Queued Slave request 0xfeefc120 >

8 ls_asm_trace: [1706645087.944565000, {"fbpfx_original_log":"*** 2024-01-23T08:14:48.744854+00:00\n2024-01-23 08:14:48.744 : kxdam_open_cell: cell o>

9 ls_asm_trace: [1706645087.944565000, {"fbpfx_original_log":"  Cell Name   : o/100.106.0.24;100.106.0.25/ (cellp=0xffd542d8)\n  ASM Name    : RECOC1>

10 ls_asm_trace: [1706645089.197526000, {"fbpfx_original_log":"*** 2024-01-23T08:14:50.749564+00:00\n2024-01-23 08:14:50.749 : kxdam_open_cell: cell o>

11 ls_asm_trace: [1706645089.197526000, {"fbpfx_original_log":"  ASM FG Name : PHX107315EXDCL13\n2024-01-23 08:14:50.750 : kxdam_dscv_fetch: Discovere>

12 ls_asm_trace: [1706645090.455490000, {"fbpfx_original_log":"*** 2024-01-23T08:14:52.753403+00:00\n2024-01-23 08:14:52.753 : kxdam_open_cell: cell o>

13 ls_asm_trace: [1706645091.695399000, {"fbpfx_original_log":"*** 2024-01-23T08:14:54.757874+00:00\n2024-01-23 08:14:54.757 : kxdam_open_cell: cell o>

14 ls_asm_trace: [1706645091.695399000, {"fbpfx_original_log":"2024-01-23 08:14:54.758 : kxdam_dscv_fetch: Discovered disk: DATAC1_CD_11_phx107315exdc>

15 ls_asm_trace: [1706645092.694059000, {"fbpfx_original_log":"*** 2024-01-23T08:14:56.762787+00:00\n2024-01-23 08:14:56.762 : kxdam_open_cell: cell o>

16 ls_asm_trace: [1706645092.694059000, {"fbpfx_original_log":"  ASM DG Name : DATAC1\n  ASM FG Name : PHX107218EXDCL01\n2024-01-23 08:14:56.763 : kxd>
  • Steps to reproduce the problem:

  • Download and unzip attached file asm.test.zip (contains ASM3_xdmg_61808.trc)

    • file will be provided later
  • Create config files: fb.conf and parser.conf, see below

  • Open 3 terminal sessions

  • In 1st session run FB

    • fluent-bit -c fb.conf -R parser.conf
  • In 2nd session run

    • cat /dev/null >ASM_new.trc; cat /dev/null >ls_asm_trace
    • head -650000 ASM3_xdmg_61808.trc | while IFS= read data; do sleep 0.000001 && echo "$data" >> ASM_new.trc; done
  • In 3rd session run

    • less -S +F -N ls_asm_trace
  • The file parser.conf contains lines with flush_timeout. Test each value 3 times and compare results.

  • The best results are returned when 500 000 ms is used

  • The worst results are returned when 1000 ms is used

  • There is no value found for flush_timeout so far which will produce proper multiline records

Expected behavior There is log content example (see attached file: ASM3_xdmg_61808.trc) we want to parse using multiline parser:

*** 2024-01-23T08:02:35.864047+00:00
2024-01-23 08:02:35.864 : [GIPCXCPT][GIPCXCPT] gipcInternalConnectSync: failed sync request, addr 0x1858b3d0 [0000000000000051] { gipcAddress : name
 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET)(GIPCID=00000000-00000000-0))', objFlags 0x0, addrFlags 0x4 }, ret gipcretConnectionRefused (29)
2024-01-23 08:02:35.864 : [GIPCXCPT][GIPCXCPT] gipcConnectSyncF [clscrsconGipcConnect : clscrscon.c : 700]: EXCEPTION[ ret gipcretConnectionRefused
(29) ]  failed sync connect endp 0x1858c760 [000000000000004e] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=00000000-0000
0000-0))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET)(GIPCID=00000000-00000000-0))', numPend 0, numReady 0, numDone 0, numDead 0
, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj 0x185847e0, sendp 0x185831f0 status 13flags 0xa108071a, flags-2 0x0, usrFlag
s 0x0 }, addr 0x1858b3d0 [0000000000000051] { gipcAddress : name 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET)(GIPCID=00000000-00000000-0))',
objFlags 0x0, addrFlags 0x4 }, flags 0x0
2024-01-23 08:02:35.867 : [GIPCXCPT][GIPCXCPT] gipcInternalConnectSync: failed sync request, addr 0x18578f60 [000000000000007f] { gipcAddress : name
 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET)(GIPCID=00000000-00000000-0))', objFlags 0x0, addrFlags 0x4 }, ret gipcretConnectionRefused (29)
2024-01-23 08:02:35.867 : [GIPCXCPT][GIPCXCPT] gipcConnectSyncF [clscrsconGipcConnect : clscrscon.c : 700]: EXCEPTION[ ret gipcretConnectionRefused
(29) ]  failed sync connect endp 0x1857a2d0 [000000000000007c] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=00000000-0000
0000-0))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET)(GIPCID=00000000-00000000-0))', numPend 0, numReady 0, numDone 0, numDead 0
, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj 0x1859a070, sendp 0x185798b0 status 13flags 0xa108071a, flags-2 0x0, usrFlag
s 0x0 }, addr 0x18578f60 [000000000000007f] { gipcAddress : name 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET)(GIPCID=00000000-00000000-0))',
objFlags 0x0, addrFlags 0x4 }, flags 0x0
2024-01-23 08:02:35.966 : kxdam_online_all: Operation ID: 0:   in diskgroup  Failed.
  SQL    : /* Exadata Auto Mgmt: Select mounted diskgroups. */
select name from v$asm_diskgroup_stat
  where
    state in ('MOUNTED', 'RESTRICTED')

  Cause  : OCI call to execute sql failed.
  Action : Contact Oracle support.

*** 2024-01-23T08:02:36.949221+00:00
2024-01-23 08:02:36.949 : kxdam_process_req_msg: Queued DG DATAC1 mount msg for processing
2024-01-23 08:02:36.949 : kxdam_process_req_msg: Queued DG RECOC1 mount msg for processing

*** 2024-01-23T08:14:44.687535+00:00
2024-01-23 08:14:44.687 : kxdam_alloc_cell: Cell o/100.100.100.100;100.100.100.100/ (cellp=0xffd54188) alloc'd.
2024-01-23 08:14:44.687 : kxdam_alloc_cell: Cell o/100.100.100.100;100.100.100.100/ (cellp=0xffd54230) alloc'd.
2024-01-23 08:14:44.687 : kxdam_alloc_cell: Cell o/100.100.100.100;100.100.100.100/ (cellp=0xffd542d8) alloc'd.

All lines between two 'timestamp' lines (example: *** 2024-01-23T08... ) must be captured as one multiline record and in json produced by Fluent Bit we should see them as one line starting by timestamp, see: {"fbpfx_original_log":"*** 2024-01-23T08:16:29.744111+00:00\nOSSIPC:SKGXP:[18509170.1]{0} .....

There must be easy way for user to find the proper value for flush_timeout parameter for multiline parser. If there are more parameters to be tuned, list of them and steps to follow need to be written in documentation.

Or some automation (AI or machine learning, pre-analyze script executed to understand the structure of file, etc) should help to advise right value. So far there is only short note about the flush_timeout in doc: https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/multiline-parsing

Screenshots image

Your Environment

  • Version used: 2.2.2
  • Configuration: see configs: fb.conf and parser.conf
  • Environment name and version (e.g. Kubernetes? What version?): Oracle ASM trace files
  • Server type and version: n/a
  • Operating System and version: Linux
  • Filters and plugins: see configs below

Additional context

Fluent Bit configuration files used:

#fb.conf
[SERVICE]
    Log_Level    debug
    HTTP_PORT    2020
    HTTP_Server  On
 
[INPUT]
    Name                tail
    Refresh_Interval    5
    Path                ./ASM*.trc
    Tag                 ls_asm_trace
    Multiline.parser    multiline-ls_asm_trace
    Skip_Empty_Lines    On
 
[FILTER]
    name modify
    match ls_asm_trace
    rename log fbpfx_original_log
 
[OUTPUT]
    Name stdout
    Match *
[OUTPUT]
    Name file
    Match *
 
 
#parser.conf
    name          multiline-ls_asm_trace
    type          regex
    #    flush_timeout 500000
    flush_timeout 50000
    #    flush_timeout 1000
    rule          "start_state" "/^\*\*\*\s*\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{1,6}\s*[\-\+]\d{2}:\d{2}\s*/"                  "cont"
    rule          "cont"        "/^(?!^\*\*\*\s*\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{1,6}\s*[\-\+]\d{2}:\d{2}\s*)(.*)/"         "cont"

Requirement Community: Please provide your experience with the multiline parsing of huge files containing jstack, dump, etc.

Developers: As you see above the parsing strictly depends on the size of flush_timeout parameter. Could we get detailed feedback from you and get changes of the current unpredictable behavior ?

emmacz avatar Jan 30 '24 21:01 emmacz

Hello @emmacz. Thanks for reporting this behavior; I was able to reproduce the issue using your configuration by following your instructions.

From your log sample:

  • There are at least 500 lines on each multiline record.
  • The size of these 500 lines is about 56k
  • When repeating the test running the same config and same set of logs, multiline records are split into different lines on each run.
  • The multiline records will split often when setting a low flush_timeout value
  • When extending the flush_timeout beyond 50,000-100,000 milliseconds, the problem may still occur, but it is not as frequent as when using a shorter timeout, such as 20,000 or 30,000 milliseconds.
  • The timestamp added by Fluent-bit on both parts of the broken multiline records is the same
  • I confirmed that the regex parser works fine; it matches both the start and continuation of a multiline record.

@edsiper is there any restriction or internal buffer limit for a multiline record?

RicardoAAD avatar Feb 02 '24 14:02 RicardoAAD

@RicardoAAD , @edsiper any update on this issue, please?

emmacz avatar Feb 20 '24 13:02 emmacz

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar May 21 '24 01:05 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Jun 08 '24 01:06 github-actions[bot]