fluent-bit
fluent-bit copied to clipboard
Multiline parsing - How to properly set value for flush_timeout
Bug Report
Describe the bug We are not able to configure multiline parser to properly catch all related lines into one record. Commonly used 1000 ms for flush_timeout does not work, 50 000 ms looks better but still we see some lines captured as separate record, 500 000 ms it works the best, but we do not know if we can use this high value.
To Reproduce
- Rubular link if applicable:
- start_state regex: https://rubular.com/r/Q1Htb8QfXZF4kx
- cont regex: https://rubular.com/r/WTK4kyxXiX660i
- Example log message if applicable: Correctly produced lines (multiline parser works): 1, 2, 3, 4, 5, 8, 10, 12, 13, 15, Wrong parsing - see lines : 6, 7, 9, 11, 14,16
1 ls_asm_trace: [1706645085.194947000, {"fbpfx_original_log":"*** 2024-01-23T08:02:35.013371+00:00\n*** SESSION ID:(1179.51634) 2024-01-23T08:02:35.0>
2 ls_asm_trace: [1706645085.195222000, {"fbpfx_original_log":"*** 2024-01-23T08:02:35.864047+00:00\n2024-01-23 08:02:35.864 : [GIPCXCPT][GIPCXCPT] gi>
3 ls_asm_trace: [1706645085.444304000, {"fbpfx_original_log":"*** 2024-01-23T08:02:36.949221+00:00\n2024-01-23 08:02:36.949 : kxdam_process_req_msg: >
4 ls_asm_trace: [1706645085.444327000, {"fbpfx_original_log":"*** 2024-01-23T08:14:44.687535+00:00\n2024-01-23 08:14:44.687 : kxdam_alloc_cell: Cell >
5 ls_asm_trace: [1706645086.446246000, {"fbpfx_original_log":"*** 2024-01-23T08:14:46.741290+00:00\n2024-01-23 08:14:46.741 : kxdam_open_cell: cell o>
6 ls_asm_trace: [1706645086.446246000, {"fbpfx_original_log":"2024-01-23 08:14:46.742 : kxdam_dscv_fetch: Discovered disk: DATAC1_CD_11_phx107315exdc>
7 ls_asm_trace: [1706645086.446246000, {"fbpfx_original_log":"2024-01-23 08:14:46.743 : kxdam_slave_wreq_create_one: Queued Slave request 0xfeefc120 >
8 ls_asm_trace: [1706645087.944565000, {"fbpfx_original_log":"*** 2024-01-23T08:14:48.744854+00:00\n2024-01-23 08:14:48.744 : kxdam_open_cell: cell o>
9 ls_asm_trace: [1706645087.944565000, {"fbpfx_original_log":" Cell Name : o/100.106.0.24;100.106.0.25/ (cellp=0xffd542d8)\n ASM Name : RECOC1>
10 ls_asm_trace: [1706645089.197526000, {"fbpfx_original_log":"*** 2024-01-23T08:14:50.749564+00:00\n2024-01-23 08:14:50.749 : kxdam_open_cell: cell o>
11 ls_asm_trace: [1706645089.197526000, {"fbpfx_original_log":" ASM FG Name : PHX107315EXDCL13\n2024-01-23 08:14:50.750 : kxdam_dscv_fetch: Discovere>
12 ls_asm_trace: [1706645090.455490000, {"fbpfx_original_log":"*** 2024-01-23T08:14:52.753403+00:00\n2024-01-23 08:14:52.753 : kxdam_open_cell: cell o>
13 ls_asm_trace: [1706645091.695399000, {"fbpfx_original_log":"*** 2024-01-23T08:14:54.757874+00:00\n2024-01-23 08:14:54.757 : kxdam_open_cell: cell o>
14 ls_asm_trace: [1706645091.695399000, {"fbpfx_original_log":"2024-01-23 08:14:54.758 : kxdam_dscv_fetch: Discovered disk: DATAC1_CD_11_phx107315exdc>
15 ls_asm_trace: [1706645092.694059000, {"fbpfx_original_log":"*** 2024-01-23T08:14:56.762787+00:00\n2024-01-23 08:14:56.762 : kxdam_open_cell: cell o>
16 ls_asm_trace: [1706645092.694059000, {"fbpfx_original_log":" ASM DG Name : DATAC1\n ASM FG Name : PHX107218EXDCL01\n2024-01-23 08:14:56.763 : kxd>
-
Steps to reproduce the problem:
-
Download and unzip attached file asm.test.zip (contains ASM3_xdmg_61808.trc)
- file will be provided later
-
Create config files: fb.conf and parser.conf, see below
-
Open 3 terminal sessions
-
In 1st session run FB
-
fluent-bit -c fb.conf -R parser.conf
-
-
In 2nd session run
-
cat /dev/null >ASM_new.trc; cat /dev/null >ls_asm_trace
-
head -650000 ASM3_xdmg_61808.trc | while IFS= read data; do sleep 0.000001 && echo "$data" >> ASM_new.trc; done
-
-
In 3rd session run
-
less -S +F -N ls_asm_trace
-
-
The file parser.conf contains lines with flush_timeout. Test each value 3 times and compare results.
-
The best results are returned when 500 000 ms is used
-
The worst results are returned when 1000 ms is used
-
There is no value found for flush_timeout so far which will produce proper multiline records
Expected behavior There is log content example (see attached file: ASM3_xdmg_61808.trc) we want to parse using multiline parser:
*** 2024-01-23T08:02:35.864047+00:00
2024-01-23 08:02:35.864 : [GIPCXCPT][GIPCXCPT] gipcInternalConnectSync: failed sync request, addr 0x1858b3d0 [0000000000000051] { gipcAddress : name
'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET)(GIPCID=00000000-00000000-0))', objFlags 0x0, addrFlags 0x4 }, ret gipcretConnectionRefused (29)
2024-01-23 08:02:35.864 : [GIPCXCPT][GIPCXCPT] gipcConnectSyncF [clscrsconGipcConnect : clscrscon.c : 700]: EXCEPTION[ ret gipcretConnectionRefused
(29) ] failed sync connect endp 0x1858c760 [000000000000004e] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=00000000-0000
0000-0))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET)(GIPCID=00000000-00000000-0))', numPend 0, numReady 0, numDone 0, numDead 0
, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj 0x185847e0, sendp 0x185831f0 status 13flags 0xa108071a, flags-2 0x0, usrFlag
s 0x0 }, addr 0x1858b3d0 [0000000000000051] { gipcAddress : name 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET)(GIPCID=00000000-00000000-0))',
objFlags 0x0, addrFlags 0x4 }, flags 0x0
2024-01-23 08:02:35.867 : [GIPCXCPT][GIPCXCPT] gipcInternalConnectSync: failed sync request, addr 0x18578f60 [000000000000007f] { gipcAddress : name
'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET)(GIPCID=00000000-00000000-0))', objFlags 0x0, addrFlags 0x4 }, ret gipcretConnectionRefused (29)
2024-01-23 08:02:35.867 : [GIPCXCPT][GIPCXCPT] gipcConnectSyncF [clscrsconGipcConnect : clscrscon.c : 700]: EXCEPTION[ ret gipcretConnectionRefused
(29) ] failed sync connect endp 0x1857a2d0 [000000000000007c] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=00000000-0000
0000-0))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET)(GIPCID=00000000-00000000-0))', numPend 0, numReady 0, numDone 0, numDead 0
, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj 0x1859a070, sendp 0x185798b0 status 13flags 0xa108071a, flags-2 0x0, usrFlag
s 0x0 }, addr 0x18578f60 [000000000000007f] { gipcAddress : name 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET)(GIPCID=00000000-00000000-0))',
objFlags 0x0, addrFlags 0x4 }, flags 0x0
2024-01-23 08:02:35.966 : kxdam_online_all: Operation ID: 0: in diskgroup Failed.
SQL : /* Exadata Auto Mgmt: Select mounted diskgroups. */
select name from v$asm_diskgroup_stat
where
state in ('MOUNTED', 'RESTRICTED')
Cause : OCI call to execute sql failed.
Action : Contact Oracle support.
*** 2024-01-23T08:02:36.949221+00:00
2024-01-23 08:02:36.949 : kxdam_process_req_msg: Queued DG DATAC1 mount msg for processing
2024-01-23 08:02:36.949 : kxdam_process_req_msg: Queued DG RECOC1 mount msg for processing
*** 2024-01-23T08:14:44.687535+00:00
2024-01-23 08:14:44.687 : kxdam_alloc_cell: Cell o/100.100.100.100;100.100.100.100/ (cellp=0xffd54188) alloc'd.
2024-01-23 08:14:44.687 : kxdam_alloc_cell: Cell o/100.100.100.100;100.100.100.100/ (cellp=0xffd54230) alloc'd.
2024-01-23 08:14:44.687 : kxdam_alloc_cell: Cell o/100.100.100.100;100.100.100.100/ (cellp=0xffd542d8) alloc'd.
All lines between two 'timestamp' lines (example: *** 2024-01-23T08... ) must be captured as one multiline record and in json produced by Fluent Bit we should see them as one line starting by timestamp, see:
{"fbpfx_original_log":"*** 2024-01-23T08:16:29.744111+00:00\nOSSIPC:SKGXP:[18509170.1]{0} .....
There must be easy way for user to find the proper value for flush_timeout parameter for multiline parser. If there are more parameters to be tuned, list of them and steps to follow need to be written in documentation.
Or some automation (AI or machine learning, pre-analyze script executed to understand the structure of file, etc) should help to advise right value. So far there is only short note about the flush_timeout in doc: https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/multiline-parsing
Screenshots
Your Environment
- Version used: 2.2.2
- Configuration: see configs: fb.conf and parser.conf
- Environment name and version (e.g. Kubernetes? What version?): Oracle ASM trace files
- Server type and version: n/a
- Operating System and version: Linux
- Filters and plugins: see configs below
Additional context
Fluent Bit configuration files used:
#fb.conf
[SERVICE]
Log_Level debug
HTTP_PORT 2020
HTTP_Server On
[INPUT]
Name tail
Refresh_Interval 5
Path ./ASM*.trc
Tag ls_asm_trace
Multiline.parser multiline-ls_asm_trace
Skip_Empty_Lines On
[FILTER]
name modify
match ls_asm_trace
rename log fbpfx_original_log
[OUTPUT]
Name stdout
Match *
[OUTPUT]
Name file
Match *
#parser.conf
name multiline-ls_asm_trace
type regex
# flush_timeout 500000
flush_timeout 50000
# flush_timeout 1000
rule "start_state" "/^\*\*\*\s*\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{1,6}\s*[\-\+]\d{2}:\d{2}\s*/" "cont"
rule "cont" "/^(?!^\*\*\*\s*\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{1,6}\s*[\-\+]\d{2}:\d{2}\s*)(.*)/" "cont"
Requirement Community: Please provide your experience with the multiline parsing of huge files containing jstack, dump, etc.
Developers: As you see above the parsing strictly depends on the size of flush_timeout parameter. Could we get detailed feedback from you and get changes of the current unpredictable behavior ?
Hello @emmacz. Thanks for reporting this behavior; I was able to reproduce the issue using your configuration by following your instructions.
From your log sample:
- There are at least 500 lines on each multiline record.
- The size of these 500 lines is about 56k
- When repeating the test running the same config and same set of logs, multiline records are split into different lines on each run.
- The multiline records will split often when setting a low flush_timeout value
- When extending the flush_timeout beyond 50,000-100,000 milliseconds, the problem may still occur, but it is not as frequent as when using a shorter timeout, such as 20,000 or 30,000 milliseconds.
- The timestamp added by Fluent-bit on both parts of the broken multiline records is the same
- I confirmed that the regex parser works fine; it matches both the start and continuation of a multiline record.
@edsiper is there any restriction or internal buffer limit for a multiline record?
@RicardoAAD , @edsiper any update on this issue, please?
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
This issue was closed because it has been stalled for 5 days with no activity.