fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

multiline: introduce static cri parser backend, providing 9mb/s performance gain over current regex cri parser

Open ryanohnemus opened this issue 1 year ago • 2 comments

As part of #9399, I have been looking into performance improvements in fluent-bit. Specifically for apps in k8s that are creating many logs within a short amount of time (ie logging >15mb/s).

Test setup: Hardware: GCP n2-standard-8 w/ ssd bootdisk With config: (see full config in example config section)

  inputs:
    - name: tail
      path: /app/cache/containers/*.log
      tag: kube.*
      skip_long_lines: on
      refresh_interval: 1
      buffer_chunk_size: 250K
      buffer_max_size: 250K
      threaded: on
      rotate_wait: 300
      storage.type: filesystem
  outputs:
    - name: "null"
      match: "*"
      workers: 3
  1. with no multiline.parser: cri (or any parser present) - i am able to achieve - 49.224Mb/s throughput
  2. adding a standard cri parser - regex (current) version - fluent-bit 3.1.8
  • multiline.parser: cri added to tail input above - throughput is 18.571Mb/s
  1. using the cri parser from this PR -
  • multiline.parser: criadded to tail input above - throughput is 27.572Mb/s ~9mb/s gain by not using a regex parser

Improving the multiline parsing speed should also considerably cut down on issues from log tail rotations (log rotations missed when fluentbit is backed up), since parsing is much more performant.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing Before we can approve your change; please submit the following in a comment:

  • [X ] Example configuration file for the change Using the k8s_perf_test from #9399 with the following service config:
service:
  flush: 0.5
  log_level: info
  daemon: off
  parsers_file: parsers.conf
  http_server: on
  http_listen: 0.0.0.0
  http_port: 2030
  storage.metrics: on
  storage.path: /app/cache/flb_storage
  storage.backlog.mem_limit: 512M
  storage.delete_irrecoverable_chunks: on
  storage.total_limit_size: 25G
pipeline:
  inputs:
    - name: tail
      path: /app/cache/containers/*.log
      tag: kube.*
      skip_long_lines: on
      refresh_interval: 1
      buffer_chunk_size: 250K
      buffer_max_size: 250K
      threaded: on
      multiline.parser: cri
      rotate_wait: 300
      storage.type: filesystem
  outputs:
    - name: "null"
      match: "*"
      workers: 3
  • [ ] Debug log output from testing the change
  • [X] Attached Valgrind output that shows no leaks or memory corruption was found
valgrind --leak-check=full ./bin/flb-it-multiline
...

SUCCESS: All unit tests have passed.
==45752== 
==45752== HEAP SUMMARY:
==45752==     in use at exit: 0 bytes in 0 blocks
==45752==   total heap usage: 22,203 allocs, 22,203 frees, 7,605,275 bytes allocated
==45752== 
==45752== All heap blocks were freed -- no leaks are possible
==45752== 
==45752== Use --track-origins=yes to see where uninitialised values come from
==45752== For lists of detected and suppressed errors, rerun with: -s
==45752== ERROR SUMMARY: 32 errors from 2 contexts (suppressed: 0 from 0)

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • [N/A] Run local packaging test showing all targets (including any new ones) build.
  • [N/A] Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • [N/A] Documentation required for this feature
  • no new documentation, parser is meant to be backward compatible

ryanohnemus avatar Sep 24 '24 16:09 ryanohnemus

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Dec 26 '24 02:12 github-actions[bot]

@edsiper @patrick-stephens - I would still like this to be reviewed, it gave a decent performance boost to cri container log processing

ryanohnemus avatar Dec 26 '24 19:12 ryanohnemus

it looks like some CI is failing but it's too old to see why, could you poke it with a push to see what's going on? There's no point reviewing it if unit tests or fuzzing is failing as that will need fixing in order to be merged so would need reviewing again then.

patrick-stephens avatar Jan 07 '25 14:01 patrick-stephens

@patrick-stephens just rebased and pushed... I will check the job output once they run

ryanohnemus avatar Jan 07 '25 14:01 ryanohnemus

Looks like fuzzing is failing but that's not my area of expertise @DavidKorczynski may be able to help point you at the right things.

patrick-stephens avatar Jan 07 '25 15:01 patrick-stephens

Looks like fuzzing is failing but that's not my area of expertise @DavidKorczynski may be able to help point you at the right things.

will take a look later today

DavidKorczynski avatar Jan 07 '25 16:01 DavidKorczynski

@ryanohnemus this looks pretty good, we will do a review on it shortly; in the meanwhile would you please add unit tests for it ?

edsiper avatar Jan 07 '25 18:01 edsiper

@edsiper - The original 'cri' parser was just instantiating a regex parser behind the scenes, this change now creates the new cri parser which doesn't use regex.

Since this code isn't opt-in I used the existing test for cri parsing while I was developing: https://github.com/fluent/fluent-bit/blob/master/tests/internal/multiline.c#L495

ryanohnemus avatar Jan 07 '25 22:01 ryanohnemus

Regarding the fuzzing issue, then it's simply reporting on a leak in the code:

=================================================================
==118==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 16384 byte(s) in 2 object(s) allocated from:
    #0 0x55eb7d51561c in realloc /src/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:82:3
    #1 0x55eb7d5c71f4 in msgpack_sbuffer_write /src/fluent-bit/lib/msgpack-c/include/msgpack/sbuffer.h:81:15
    #2 0x55eb7d5c71f4 in msgpack_pack_map /src/fluent-bit/lib/msgpack-c/include/msgpack/pack_template.h:745:9
    #3 0x55eb7d5c71f4 in flb_parser_cri_do /src/fluent-bit/src/flb_parser_cri.c:74:5
    #4 0x55eb7d5b1a0c in ml_append_try_parser_type_text /src/fluent-bit/src/multiline/flb_ml.c:501:15
    #5 0x55eb7d5abb32 in ml_append_try_parser_type_map /src/fluent-bit/src/multiline/flb_ml.c:557:28
    #6 0x55eb7d5abb32 in ml_append_try_parser /src/fluent-bit/src/multiline/flb_ml.c:604:15
    #7 0x55eb7d5ac646 in flb_ml_append_object /src/fluent-bit/src/multiline/flb_ml.c:846:15
    #8 0x55eb7d552ac8 in test_multiline_parser /src/fluent-bit/tests/internal/fuzzers/multiline_fuzzer.c:95:17
    #9 0x55eb7d55340e in LLVMFuzzerTestOneInput /src/fluent-bit/tests/internal/fuzzers/multiline_fuzzer.c:166:13
    ```

DavidKorczynski avatar Jan 07 '25 22:01 DavidKorczynski

@DavidKorczynski are there instructions on running the fuzzer locally?

ryanohnemus avatar Jan 09 '25 18:01 ryanohnemus

@patrick-stephens @edsiper - i fixed the fuzzer issues and it should be ready for review but now all of the unit tests are failing on a custom_calyptia.sh which seems new and unrelated to this change.

ryanohnemus avatar Jan 10 '25 22:01 ryanohnemus

@patrick-stephens @edsiper - i fixed the fuzzer issues and it should be ready for review but now all of the unit tests are failing on a custom_calyptia.sh which seems new and unrelated to this change.

Yup I'm looking at that, the test is triggering a spurious failure in the framework on macOS so I'll resolve but it's unrelated to your change so can be ignored

patrick-stephens avatar Jan 11 '25 19:01 patrick-stephens

@patrick-stephens @edsiper - i fixed the fuzzer issues and it should be ready for review but now all of the unit tests are failing on a custom_calyptia.sh which seems new and unrelated to this change.

Yup I'm looking at that, the test is triggering a spurious failure in the framework on macOS so I'll resolve but it's unrelated to your change so can be ignored

@ryanohnemus this should be resolved I think now on latest master.

patrick-stephens avatar Jan 13 '25 10:01 patrick-stephens