fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

[out_loki] A lot of warning 'Tenant ID is overwritten A -> B' if tenant_id_key is used

Open YevhenLodovyi opened this issue 9 months ago • 2 comments

Hello,

Your Environment

  • Version used: 3.0.2
  • Environment name and version (e.g. Kubernetes? What version?): eks

I am using flb to send logs to loki. I am trying to seperate logs, so I am using multi-tenancy. I have a lua script to generate the tenant_id, so in the output i have:

      [OUTPUT]
          name        loki
          match       kube.*
          host        loki.internal
          port        3100
          tls         on
          tls.verify  off
          tenant_id_key tenant_id
          labels      source=eks,namespace=$kubernetes['namespace_name'],container=$kubernetes['container_name'],app=$kubernetes['labels']['app']
          remove_keys stream,_p,$kubernetes['labels']
          compress    gzip
          Retry_Limit False

As far as I can see the logs are distributed properly, but I have a lot of warning:

│ [2024/05/09 13:19:22] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps                                                                                                                                                │
│ [2024/05/09 13:19:23] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra                                                                                                                                                │
│ [2024/05/09 13:19:23] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps                                                                                                                                                │
│ [2024/05/09 13:19:24] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra                                                                                                                                                │
│ [2024/05/09 13:19:24] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps                                                                                                                                                │
│ [2024/05/09 13:19:25] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra                                                                                                                                                │
│ [2024/05/09 13:19:25] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps                                                                                                                                                │
│ [2024/05/09 13:19:26] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra                                                                                                                                                │
│ [2024/05/09 13:19:26] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps                                                                                                                                                │
│ [2024/05/09 13:19:27] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra                                                                                                                                                │
│ [2024/05/09 13:19:27] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps                                                                                                                                                │
│ [2024/05/09 13:19:28] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra                                                                                                                                                │
│ [2024/05/09 13:19:30] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps                                                                                                                                                │
│ [2024/05/09 13:19:31] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra                                                                                                                                                │
│ [2024/05/09 13:19:32] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps                                                                                                                                                │
│ [2024/05/09 13:19:33] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra                                                                                                                                                │
│ [2024/05/09 13:19:34] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps                                                                                                                                                │
│ [2024/05/09 13:19:35] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra                                                                                                                                                │
│ [2024/05/09 13:19:36] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps                                                                                                                                                │
│ [2024/05/09 13:19:37] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra                                                                                                                                                │
│ [2024/05/09 13:19:41] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps                                                                                                                                                │
│ [2024/05/09 13:19:42] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra                                                                                                                                                │
│ [2024/05/09 13:19:43] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps                                                                                                                                                │
│ [2024/05/09 13:19:44] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra                                                                                                                                                │
│ [2024/05/09 13:19:44] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps                                                                                                                                                │
│ [2024/05/09 13:19:45] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra                                                                                                                                                │
│ [2024/05/09 13:19:45] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps                                                                                                                                                │
│ [2024/05/09 13:19:46] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra

The warn is defined here: https://github.com/fluent/fluent-bit/blob/master/plugins/out_loki/loki.c#L1152

YevhenLodovyi avatar May 09 '24 13:05 YevhenLodovyi

Why you used fluentbit not promtail to send data to Loki, I think promtail will good to use when with Loki

zhangzx1996 avatar May 14 '24 02:05 zhangzx1996

@YevhenLodovyi, are your log entries separated into different chunks, e. g. by making sure that they get re-emitted with a tag that contains the actual value of the tenant_id? Have a look at https://github.com/fluent/fluent-bit/issues/2935#issuecomment-808657942, that describes the issue quite well. It would be helpful if you provided the lua script and the rest of the fluent-bit config.

If your chunks are well aligned:

We are possibly encountering the same issue, where a race condition seems to be involved. I guess it was introduced (or at least not fixed) by PR #6931 where dynamic_tenant_id gets shared within the thread. I guess due to usage of coroutines, the value changes between loki_compose_payload and flb_http_add_header.

But I have to admit that I don't see a context switch in between. Maybe @leonardo-albertovich could shed some light on it?

Though we are quite sure that the value changes in between, as we captured the traffic using tcpdump and the chunks are well aligned and contain log messages from (in our case) distinct namespaces only, but are tagged with the wrong X-Scope-OrgID header.

We have configured Tenant_id_key to customer and the request looks like this:

POST /loki/api/v1/push HTTP/1.1
Host: loki-gateway.logging.svc:80
Content-Length: 944
User-Agent: Fluent-Bit
Content-Type: application/json
X-Scope-OrgID: customer1
Connection: keep-alive

{"streams":[{"stream":{"job":"fluent-bit","node":"worker11","namespace":"customer2-namespace"},"values":[["1715604287815933056","{\"stream\":\"stdout\",\"logtag\":\"F\",\"message\":\"INFO      trace_generator_slow - generate_slow_traces - SlowOperation created in customer2-namespace - Trace ID: 95f6274c96752cf944547abda39e24b1, Span ID: 5eaddc539554b1ec - 13/05/2024 Monday 12:44:47\",\"namespace_name\":\"customer2-namespace\",\"host\":\"worker11\",\"kubernetes\":{\"pod_name\":\"trace-generator-slow-job-swzcs\",\"namespace_name\":\"customer2-namespace\",\"container_name\":\"trace-generator-slow\",\"labels\":{\"batch.kubernetes.io/controller-uid\":\"f935d0ea-6e29-4be8-a379-0126cb2d2b8d\",\"batch.kubernetes.io/job-name\":\"trace-generator-slow-job\",\"controller-uid\":\"f935d0ea-6e29-4be8-a379-0126cb2d2b8d\",\"job-name\":\"trace-generator-slow-job\"}},\"customer\":\"customer2\",\"cluster\":\"cluster-name\"}"]]}]}

If this is a separate topic, I'll file an additional issue.

Update: We recompiled fluent-bit with FLB_OUTPUT_SYNCHRONOUS flag set in out_loki.c. The issue still persists. Either the cause is something else, or we misinterpreted the meaning of the flag.

cm-rudolph avatar May 14 '24 11:05 cm-rudolph

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Aug 17 '24 01:08 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Aug 22 '24 01:08 github-actions[bot]

still actual

aston-r avatar Aug 22 '24 05:08 aston-r