fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

Wildcard routing not working for Loki

Open vishwa-trulioo opened this issue 2 years ago • 2 comments

Bug Report

Describe the bug I use the following configuration for Linux hosts (Amazon Linux 2 and Ubuntu 22.04) attempting to send metrics/logs scrapped by Fluent-bit to Loki logging service. I"m currently running 1.9.7.

[SERVICE]
    flush               5
    daemon              Off
    log_level           debug
    parsers_file        parsers.conf
    plugins_file        plugins.conf
    http_server         on
    http_listen         0.0.0.0
    http_port           2020
    storage.metrics     on

[INPUT]
    name            cpu
    tag             local.cpu
    interval_sec    5

[INPUT]
    name            mem
    tag             local.mem
    interval_sec    5

[INPUT]
    Name            disk
    Tag             local.disk
    interval_sec    5

[INPUT]
    Name                tail
    Path                /var/log/messages
    Parser              syslog-rfc5424
    Tag                 local.varlogmsg
    Refresh_Interval    5

[INPUT]
    name                fluentbit_metrics
    tag                 internal_metrics
    scrape_interval     2

[OUTPUT]
    Name            loki
    Match           *
    Host            logs-prod3.grafana.net
    port            443
    tls             on
    tls.verify      on
    http_user       XXXXXX
    http_passwd     XXXXXX
    labels          job=fluentbit,service=app-test,env=test
    label_keys      $sub['stream']

[OUTPUT]
    Name        file
    Match       *
    Path        /tmp
    File        fluentbit_output.log

When I set it up like this it only sends cpu metrics to Loki. All other Inputs are ignored. But, I can see the all the other outputs correctly send to /tmp/fluentbit_output.log correctly.

However, if I duplicate the loki OUTPUT block multiple times and individually map each input tag to them, then the data does show up in Loki. (Refer the config file below for example)

I expected the wildcard operation (*) of the Loki will be able to capture all inputs and send to Loki. (Ref: Routing with Wildcard) Please let me know if I am doing this wrong?

[SERVICE]
    flush               5
    daemon              Off
    log_level           debug
    parsers_file        parsers.conf
    plugins_file        plugins.conf
    http_server         on
    http_listen         0.0.0.0
    http_port           2020
    storage.metrics     on

[INPUT]
    name            cpu
    tag             local.cpu
    interval_sec    5

[INPUT]
    name            mem
    tag             local.mem
    interval_sec    5

[INPUT]
    Name            disk
    Tag             local.disk
    interval_sec    5

[INPUT]
    Name                tail
    Path                /var/log/messages
    Parser              syslog-rfc5424
    Tag                 local.varlogmsg
    Refresh_Interval    5

[INPUT]
    Name                tail
    Path                /var/log/secure
    Parser              syslog-rfc5424
    Tag                 local.sshlog
    Refresh_Interval    5

[INPUT]
    name                fluentbit_metrics
    tag                 internal_metrics
    scrape_interval     2

[OUTPUT]
    Name            loki
    Match           local.cpu
    Host            logs-prod3.grafana.net
    port            443
    tls             on
    tls.verify      on
    http_user       XXXXXX
    http_passwd     XXXXXX
    labels          job=fluentbit,service=app-test,env=test
    label_keys      $sub['stream']

[OUTPUT]
    Name            loki
    Match           local.mem
    Host            logs-prod3.grafana.net
    port            443
    tls             on
    tls.verify      on
    http_user       XXXXXX
    http_passwd     XXXXXX
    labels          job=fluentbit,service=app-test,env=test
    label_keys      $sub['stream']

[OUTPUT]
    Name            loki
    Match           local.varlogmsg
    Host            logs-prod3.grafana.net
    port            443
    tls             on
    tls.verify      on
    http_user       XXXXXX
    http_passwd     XXXXXX
    labels          job=fluentbit,service=app-test,env=test
    label_keys      $sub['stream']

[OUTPUT]
    Name        file
    Match       *
    Path        /tmp
    File        fluentbit_output.log

vishwa-trulioo avatar Sep 08 '22 00:09 vishwa-trulioo

Wildcard routing has worked fine for me exactly like that with Loki. Can you check there are no errors/warnings in the Fluent Bit logs? I'm not sure what happens if a key is missing as well so maybe try removing that? label_keys $sub['stream']

I was doing this for a blog post a while back so have some Grafana Cloud examples as well: https://github.com/calyptia/openshift-fluent-bit-examples

I use wildcard routing and it was fine: https://github.com/calyptia/openshift-fluent-bit-examples/blob/427e1adb0e89bd5992d2df222af4a9ecf15d6a38/grafana-cloud/values-grafana-cloud.yaml#L16-L25

patrick-stephens avatar Sep 08 '22 11:09 patrick-stephens

@patrick-stephens Yes, I did try without label_keys $sub['stream'] in the config a moment ago as well. It behaves the same. Again, the issue is not that it doesn't work, instead the Loki output only picks up only one INPUT source. Is there any other data I can provide to help with the investigation?

vishwa-trulioo avatar Sep 08 '22 15:09 vishwa-trulioo

Hello @patrick-stephens I repro this issue using :

  • Fluent-bit 1.9.7
  • Latest Loki and Grafana docker image
  • GCP e2-micro Instance running Ubuntu 20.04
  • Fluent-bit config:
[SERVICE]
    flush               1
    daemon              Off
    parsers_file        ../conf/parsers.conf
    log_level           debug

[INPUT]
    Name docker
    Include  d9f819b89974 1479cbb42d71
    Tag my_tag2
    Interval_Sec 10 

[INPUT]
    name            cpu
    Tag             my_tag3
    interval_sec    10

[INPUT]
    name            mem
    Tag             my_tag4
    interval_sec    10

[INPUT]
    Name            disk
    Tag             my_tag5
    interval_sec    10

[Output]
    Name loki
    Match *
    Host 127.0.0.1
    port 3100
    Labels job=fluent

When running this config with interval_sec=10 or 5 there are no issues on my end, the data for all the plugins configured is shown in Grafana, but as soon as I change this setting to anything closer to 1 the output for these plugins is not sent to Loki, and after terminating fluent-bit you will see the count of pending tasks for all the plugins that didn't reach Loki, with interval_sec =1 FB only sends data for the first plugin configured in the Fluent_bit config file, in my case is the docker input plugin.

As @vishwa-trulioo mentioned if you add a Loki output targeting each input tag all the data is received in Loki and shown in grafana

This is also happening in Fluent-Bit 2.0, the data is not making it to Loki, as I mentioned if you send the data from CPU, mem, docker, and disk to the standard output you'll see the data from all these plugins but it is not reaching Loki.
`

[2022/09/27 11:09:28] [debug] [upstream] KA connection #38 to 127.0.0.1:3100 is now available [2022/09/27 11:09:28] [debug] [out flush] cb_destroy coro_id=42 [2022/09/27 11:09:28] [debug] [task] destroy task=0x7fd3d000ed80 (task_id=0) ^C[2022/09/27 11:09:28] [engine] caught signal (SIGINT) [2022/09/27 11:09:28] [ info] [input] pausing docker.0 [2022/09/27 11:09:28] [ info] [input] pausing cpu.1 [2022/09/27 11:09:28] [debug] [task] created task=0x7fd3d000ed80 id=0 OK [2022/09/27 11:09:28] [debug] [task] created task=0x7fd3d18ae0b0 id=129 OK [2022/09/27 11:09:28] [debug] [task] created task=0x7fd3d18aca70 id=130 OK

. .

[2022/09/27 11:09:32] [debug] [output:loki:loki.0] 127.0.0.1:3100, HTTP status=204 [2022/09/27 11:09:32] [debug] [upstream] KA connection #38 to 127.0.0.1:3100 is now available [2022/09/27 11:09:32] [debug] [out flush] cb_destroy coro_id=49 [2022/09/27 11:09:32] [debug] [task] destroy task=0x7fd3d0010a50 (task_id=15) [2022/09/27 11:09:33] [ info] [task] docker/docker.0 has 0 pending task(s): [2022/09/27 11:09:33] [ info] [task] cpu/cpu.1 has 38 pending task(s): [2022/09/27 11:09:33] [ info] [task] task_id=18 still running on route(s): loki/loki.0 [2022/09/27 11:09:33] [ info] [task] task_id=21 still running on route(s): loki/loki.0 [2022/09/27 11:09:33] [ info] [task] task_id=24 still running on route(s): loki/loki.0 . . . [2022/09/27 11:09:33] [ info] [task] task_id=123 still running on route(s): loki/loki.0 [2022/09/27 11:09:33] [ info] [task] task_id=126 still running on route(s): loki/loki.0 [2022/09/27 11:09:33] [ info] [task] task_id=129 still running on route(s): loki/loki.0 [2022/09/27 11:09:33] [ info] [task] mem/mem.2 has 44 pending task(s): [2022/09/27 11:09:33] [ info] [task] task_id=2 still running on route(s): loki/loki.0 [2022/09/27 11:09:33] [ info] [task] task_id=4 still running on route(s): loki/loki.0 [2022/09/27 11:09:33] [ info] [task] task_id=7 still running on route(s): loki/loki.0 [2022/09/27 11:09:33] [ info] [task] task_id=10 still running on route(s): loki/loki.0 . . . . [2022/09/27 11:09:33] [ info] [task] task_id=124 still running on route(s): loki/loki.0 [2022/09/27 11:09:33] [ info] [task] task_id=127 still running on route(s): loki/loki.0 [2022/09/27 11:09:33] [ info] [task] task_id=130 still running on route(s): loki/loki.0 [2022/09/27 11:09:33] [ info] [task] disk/disk.3 has 43 pending task(s): [2022/09/27 11:09:33] [ info] [task] task_id=5 still running on route(s): loki/loki.0 [2022/09/27 11:09:33] [ info] [task] task_id=8 still running on route(s): loki/loki.0 [2022/09/27 11:09:33] [ info] [task] task_id=11 still running on route(s): loki/loki.0 [2022/09/27 11:09:33] [ info] [task] task_id=14 still running on route(s): loki/loki.0 `

Full debug log attached: FB-debug.odt

RicardoAAD avatar Sep 27 '22 12:09 RicardoAAD

Hello @vishwa-trulioo

The Loki output plugin disabled processing multiple tasks per flush, because Loki historically did not support out-of-order writes, now it does, and by removing the flag FLB_OUTPUT_NO_MULTIPLEX from the Loki output plugin in the PR by @sflanker https://github.com/fluent/fluent-bit/pull/6136 solves the problem you have described in this issue.

This was recently merged into the master branch and tested using an almost exact configuration that you provided when this issue was open.

Config file:

[SERVICE]
    flush               5
    daemon              Off
    log_level           debug
    parsers_file        ../../conf/parsers.conf
    plugins_file        plugins.conf
    http_server         on
    http_listen         0.0.0.0
    http_port           2020
    storage.metrics     on

[INPUT]
    name            cpu
    tag             local.cpu
    interval_sec    5

[INPUT]
    name            mem
    tag             local.mem
    interval_sec    5

[INPUT]
    Name            disk
    Tag             local.disk
    interval_sec    5

[INPUT]
    Name                tail
    Path                /var/log/syslog
    Parser              syslog-rfc5424
    Tag                 local.syslog
    Refresh_Interval    5

[INPUT]
    name                fluentbit_metrics
    tag                 internal_metrics
    scrape_interval     2

[Output]
    Name loki
    Match *
    Host 127.0.0.1
    port 3100
    Labels job=fluentbit,service=app,env=test
    label_keys $sub['stream']

[OUTPUT]
    Name        file
    Match       *
    Path        /tmp
    File        fluentbit_output.log


You can check these articles to test Fluent-Bit v2.0 which includes this change, but as this is not an official version yet, is not intended for production environment. https://docs.fluentbit.io/manual/installation/sources/download-source-code https://docs.fluentbit.io/manual/v/2.0-pre/installation/sources/build-and-install

Please note: the master branch will be our next release 2.0, you can also test it with an unofficial image https://github.com/fluent/fluent-bit/tree/master/dockerfiles#ghcrio-topology

RicardoAAD avatar Oct 11 '22 13:10 RicardoAAD

@RicardoAAD Thank you very for working it out. I'm looking forward to test it on my side as soon as Fluent bit 2.0 is released. For the time being I will be setting Interval_sec=10 sec. Thanks once again and appreciate the assistance and clarifications.

vishwa-trulioo avatar Oct 12 '22 16:10 vishwa-trulioo

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Jan 11 '23 02:01 github-actions[bot]

@vishwa-trulioo is this resolved now?

patrick-stephens avatar Jan 11 '23 10:01 patrick-stephens

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Apr 13 '23 01:04 github-actions[bot]