parser: regex: Do not skip empty regex group matches
Regular Expression Parser is skipping empty values #1486
Unlike the other parses, empty regex groups are omitted from the output.
Sample setup:
$ cat sample.in
{"log": "{\"time_local\":\"2019-07-31T21:17:15\",\"client_ip\":\"\"}"}
$ cat sample.conf
[SERVICE]
Flush 5
Parsers_File parsers.conf
[INPUT]
Name stdin
[FILTER]
Name parser
Parser json_regex
Match *
Key_Name log
Reserve_Data On
Preserve_Key On
[OUTPUT]
Name stdout
Format json_lines
$ cat parsers.conf
[PARSER]
Name json_regex
Format regex
Regex ^{"time_local":"(?<time_local>.*?)","client_ip":"(?<client_ip>.*?)"}$
Output with this patch applied:
$ cat sample.in | bin/fluent-bit -c sample.conf -p parsers.conf
Fluent Bit v1.4.0
Copyright (C) Treasure Data
[2020/01/27 10:21:55] [ info] [storage] initializing...
[2020/01/27 10:21:55] [ info] [storage] in-memory
[2020/01/27 10:21:55] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/01/27 10:21:55] [ info] [engine] started (pid=8468)
[2020/01/27 10:21:55] [ info] [sp] stream processor started
[2020/01/27 10:21:55] [ warn] [in_stdin] end of file (stdin closed by remote end)
[2020/01/27 10:21:55] [ info] [input] pausing stdin.0
{"date":1580084515.652593,"time_local":"2019-07-31T21:17:15","client_ip":"","log":"{\"time_local\":\"2019-07-31T21:17:15\",\"client_ip\":\"\"}"}
[2020/01/27 10:21:55] [ warn] [engine] service will stop in 5 seconds
[2020/01/27 10:21:59] [ info] [engine] service stopped
Without this change the "client_ip":"" would be missing from the output.
I think a hazard of this change is that we can't tell which groups are empty versus omitted.
For example:
$ cat parsers2.conf
[PARSER]
Name json_regex
Format regex
Regex ^{"time_local":"(?<time_local>.*?)"(,"client_ip":"(?<client_ip>.*?)")?}$
$ cat sample2.in
{"log": "{\"time_local\":\"2019-07-31T21:17:15\",\"client_ip\":\"\"}"}
{"log": "{\"time_local\":\"2019-07-31T21:17:15\"}"}
$ cat sample2.in | bin/fluent-bit -c sample2.conf -p parsers2.conf
Fluent Bit v1.4.0
Copyright (C) Treasure Data
[2020/01/27 10:31:24] [ info] [storage] initializing...
[2020/01/27 10:31:24] [ info] [storage] in-memory
[2020/01/27 10:31:24] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/01/27 10:31:24] [ info] [engine] started (pid=10386)
[2020/01/27 10:31:24] [ info] [sp] stream processor started
[2020/01/27 10:31:24] [ warn] [in_stdin] end of file (stdin closed by remote end)
[2020/01/27 10:31:24] [ info] [input] pausing stdin.0
{"date":1580085084.179838,"time_local":"2019-07-31T21:17:15","client_ip":"","log":"{\"time_local\":\"2019-07-31T21:17:15\",\"client_ip\":\"\"}"}
{"date":1580085084.179842,"time_local":"2019-07-31T21:17:15","client_ip":"","log":"{\"time_local\":\"2019-07-31T21:17:15\"}"}
[2020/01/27 10:31:24] [ warn] [engine] service will stop in 5 seconds
[2020/01/27 10:31:28] [ info] [engine] service stopped
hmmm I suggest to introduce a new configuration property to the parsers called Skip_Empty_Keys set to true by default. So your patch can work if the property is set to false. On that way, we won't break other deloyments.
ping
Oh, thanks for the ping. Had completely forgotten about this one.
Updated the PR with Skip_Empty_Keys configuration property. Will go ahead and do a documentation update also.
@nigels-com
- pls fix conflicts
- add DCO
@nigels-com How about this PR ? If you forget this one, is it OK that I will create another PR in the same way ?