fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

Stucked with 100 percent cpu usage with large json

Open LukasJerabek opened this issue 1 year ago • 10 comments

Bug Report

Describe the bug If I have a file with one json that is 3 MB big, fluentbit would get stuck and take all cpu and never finishes processing that json.

To Reproduce Put this file into you home folder: myfile.json

run: fluent-bit -c fluent-bit.conf

with configuration created below

fluent-bit_parsers.conf:

[PARSER]
    Name   json
    Format json

fluent-bit.conf:

[SERVICE]
    Flush        5
    Daemon       Off
    Log_Level    info
    parsers_file /home/vagrant/fluent-bit_parsers.conf

[INPUT]
    Name            tail
    Read_from_Head  True
    Tag             psql_statements
    Path            /home/vagrant/myfile.json
    parser          json
    Buffer_Max_Size 100mb
    mem_buf_limit   10mb

[OUTPUT]
    Name file
    Match *
    Path /home/vagrant/
    File result.log
    Format plain

Expected behavior I would expect that fluentbit would handle 3MB json withou any problem

Screenshots image

Your Environment

  • Version used: 2.2.2
  • Configuration: above
  • Environment name and version: built from source
  • Server type and version: ?
  • Operating System and version: Centos7 on dev but also reproduced on Rocky Linux 8 in production.
  • Filters and plugins: just in above configuration

Additional context I was trying to use filesystem storage, did not help. Btw .flb file was created but not processed. If I cut the json to be smaller, it works fine.

LukasJerabek avatar Feb 16 '24 11:02 LukasJerabek

Can you please attach the Fluent Bit's log when running this test?

lecaros avatar Feb 16 '24 13:02 lecaros

Its clear, nothing there at all.

LukasJerabek avatar Feb 16 '24 17:02 LukasJerabek

Hi @LukasJerabek, it seems that the bottleneck is in the output file. I'll further troubleshoot this and let you know.

lecaros avatar Feb 16 '24 22:02 lecaros

Thank you. I am trying to migrate from fluentd to fluentbit and this gave me headache, luckily I was able to come up with this minimal reproducer.

LukasJerabek avatar Feb 16 '24 22:02 LukasJerabek

@lecaros hows this one going? Seems to me like quite a big performance problem. File plugin is probably used by many.

LukasJerabek avatar Feb 21 '24 11:02 LukasJerabek

Hi @LukasJerabek, I don't know how common it is to have a single record of 3MB+; however, I've found where most of the CPU time is going. I've also tested a version with a minor change that helps with this particular scenario. Feel free to build from there and share your results. I'll report my findings next.

How often do you have individual records of this size?

lecaros avatar Feb 21 '24 19:02 lecaros

To repro this issue, the config and log file provided by the OP are enough. I've profiled the run and noticed that most of the CPU is used during the flush callback of the File output plugin. Further code review took me to the loop below: https://github.com/fluent/fluent-bit/blob/202da1374dffd98984b2aa56dd444d49bef1dbd5/src/flb_pack.c#L1167-L1185

In particular, line 1171, where the buffer size is being expanded on each call. Given that the record size is about 3.9MB and the size increments are only 128 bytes, it takes many iterations to grow to the expected size.

I've changed the increment for testing and hypothesis validation purposes to a higher value.

https://github.com/lecaros/fluent-bit/blob/f0fffb74d282919e4197bf20111d80b5c5c2baf5/src/flb_pack.c#L1171

During the tests, the behavior changed from apparently hanging with 100%+ CPU to processing the file in a few seconds.

The increment value should probably be calculated (log, exp, percentage, etc.) instead of a fixed one.

lecaros avatar Feb 21 '24 19:02 lecaros

ping @edsiper

lecaros avatar Feb 21 '24 19:02 lecaros

@lecaros thank you.

Btw case why the json is so big: I have postgres query on pg_stat_statements on one fluentbit instance which is sending the data over to other fluentbit instance every 10 minutes where long story short it ends up in database on that server (btw forward output plugin does not have a problem with this). Then I can query that in grafana. But on saving these data I do aggregations over the whole state of pg_stat_statements at that time so I need to send them in bulk - in one json with queries data in array under "data" key. Thats why the json is 4 MB big.

LukasJerabek avatar Feb 22 '24 08:02 LukasJerabek

In my testing output file was created after few seconds but filling it took another 40 seconds or so during which cpu was on 100 percent. But the time will of course vary depending on the comp power... Howecver this value still seems not effective enough to me, some calculation as you suggested sounds good (log, exp, percentage, etc.).

LukasJerabek avatar Feb 22 '24 16:02 LukasJerabek

@lecaros So I have tested with

         size *= 2; 

and that worked flawlessly. I think it is also very reasonable in a sense, that if I need 1025 bits and I allocate 2048 its not that much more than needed, so doubles feels fine. On average in this case the mismatch will be 512, therefore half of original value will be the mismatch not whole 1024 like in the most extreme example. Maybe something between 1.5-2 would still be reasonable.

Do you have any more troubleshooting planned for this?

LukasJerabek avatar Mar 11 '24 15:03 LukasJerabek

Hi @LukasJerabek, thanks for the feedback. I have no further plans for this issue. Next step should be submitting a PR. cc @edsiper

lecaros avatar Mar 11 '24 15:03 lecaros

Ok thanks, btw I am asking also because the label of this issue is "troubleshooting".

LukasJerabek avatar Mar 11 '24 16:03 LukasJerabek

fixed in #8589 , thank you

edsiper avatar Mar 15 '24 23:03 edsiper