fluent-bit
fluent-bit copied to clipboard
Stucked with 100 percent cpu usage with large json
Bug Report
Describe the bug If I have a file with one json that is 3 MB big, fluentbit would get stuck and take all cpu and never finishes processing that json.
To Reproduce Put this file into you home folder: myfile.json
run:
fluent-bit -c fluent-bit.conf
with configuration created below
fluent-bit_parsers.conf:
[PARSER]
Name json
Format json
fluent-bit.conf:
[SERVICE]
Flush 5
Daemon Off
Log_Level info
parsers_file /home/vagrant/fluent-bit_parsers.conf
[INPUT]
Name tail
Read_from_Head True
Tag psql_statements
Path /home/vagrant/myfile.json
parser json
Buffer_Max_Size 100mb
mem_buf_limit 10mb
[OUTPUT]
Name file
Match *
Path /home/vagrant/
File result.log
Format plain
Expected behavior I would expect that fluentbit would handle 3MB json withou any problem
Screenshots
Your Environment
- Version used: 2.2.2
- Configuration: above
- Environment name and version: built from source
- Server type and version: ?
- Operating System and version: Centos7 on dev but also reproduced on Rocky Linux 8 in production.
- Filters and plugins: just in above configuration
Additional context I was trying to use filesystem storage, did not help. Btw .flb file was created but not processed. If I cut the json to be smaller, it works fine.
Can you please attach the Fluent Bit's log when running this test?
Its clear, nothing there at all.
Hi @LukasJerabek, it seems that the bottleneck is in the output file. I'll further troubleshoot this and let you know.
Thank you. I am trying to migrate from fluentd to fluentbit and this gave me headache, luckily I was able to come up with this minimal reproducer.
@lecaros hows this one going? Seems to me like quite a big performance problem. File plugin is probably used by many.
Hi @LukasJerabek, I don't know how common it is to have a single record of 3MB+; however, I've found where most of the CPU time is going. I've also tested a version with a minor change that helps with this particular scenario. Feel free to build from there and share your results. I'll report my findings next.
How often do you have individual records of this size?
To repro this issue, the config and log file provided by the OP are enough. I've profiled the run and noticed that most of the CPU is used during the flush callback of the File output plugin. Further code review took me to the loop below: https://github.com/fluent/fluent-bit/blob/202da1374dffd98984b2aa56dd444d49bef1dbd5/src/flb_pack.c#L1167-L1185
In particular, line 1171, where the buffer size is being expanded on each call. Given that the record size is about 3.9MB and the size increments are only 128 bytes, it takes many iterations to grow to the expected size.
I've changed the increment for testing and hypothesis validation purposes to a higher value.
https://github.com/lecaros/fluent-bit/blob/f0fffb74d282919e4197bf20111d80b5c5c2baf5/src/flb_pack.c#L1171
During the tests, the behavior changed from apparently hanging with 100%+ CPU to processing the file in a few seconds.
The increment value should probably be calculated (log, exp, percentage, etc.) instead of a fixed one.
ping @edsiper
@lecaros thank you.
Btw case why the json is so big: I have postgres query on pg_stat_statements on one fluentbit instance which is sending the data over to other fluentbit instance every 10 minutes where long story short it ends up in database on that server (btw forward output plugin does not have a problem with this). Then I can query that in grafana. But on saving these data I do aggregations over the whole state of pg_stat_statements at that time so I need to send them in bulk - in one json with queries data in array under "data" key. Thats why the json is 4 MB big.
In my testing output file was created after few seconds but filling it took another 40 seconds or so during which cpu was on 100 percent. But the time will of course vary depending on the comp power... Howecver this value still seems not effective enough to me, some calculation as you suggested sounds good (log, exp, percentage, etc.).
@lecaros So I have tested with
size *= 2;
and that worked flawlessly. I think it is also very reasonable in a sense, that if I need 1025 bits and I allocate 2048 its not that much more than needed, so doubles feels fine. On average in this case the mismatch will be 512, therefore half of original value will be the mismatch not whole 1024 like in the most extreme example. Maybe something between 1.5-2 would still be reasonable.
Do you have any more troubleshooting planned for this?
Hi @LukasJerabek, thanks for the feedback. I have no further plans for this issue. Next step should be submitting a PR. cc @edsiper
Ok thanks, btw I am asking also because the label of this issue is "troubleshooting".
fixed in #8589 , thank you