fluent-bit
fluent-bit copied to clipboard
[out_loki] A lot of warning 'Tenant ID is overwritten A -> B' if tenant_id_key is used
Hello,
Your Environment
- Version used: 3.0.2
- Environment name and version (e.g. Kubernetes? What version?): eks
I am using flb to send logs to loki. I am trying to seperate logs, so I am using multi-tenancy. I have a lua script to generate the tenant_id, so in the output i have:
[OUTPUT]
name loki
match kube.*
host loki.internal
port 3100
tls on
tls.verify off
tenant_id_key tenant_id
labels source=eks,namespace=$kubernetes['namespace_name'],container=$kubernetes['container_name'],app=$kubernetes['labels']['app']
remove_keys stream,_p,$kubernetes['labels']
compress gzip
Retry_Limit False
As far as I can see the logs are distributed properly, but I have a lot of warning:
│ [2024/05/09 13:19:22] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps │
│ [2024/05/09 13:19:23] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra │
│ [2024/05/09 13:19:23] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps │
│ [2024/05/09 13:19:24] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra │
│ [2024/05/09 13:19:24] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps │
│ [2024/05/09 13:19:25] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra │
│ [2024/05/09 13:19:25] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps │
│ [2024/05/09 13:19:26] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra │
│ [2024/05/09 13:19:26] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps │
│ [2024/05/09 13:19:27] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra │
│ [2024/05/09 13:19:27] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps │
│ [2024/05/09 13:19:28] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra │
│ [2024/05/09 13:19:30] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps │
│ [2024/05/09 13:19:31] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra │
│ [2024/05/09 13:19:32] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps │
│ [2024/05/09 13:19:33] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra │
│ [2024/05/09 13:19:34] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps │
│ [2024/05/09 13:19:35] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra │
│ [2024/05/09 13:19:36] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps │
│ [2024/05/09 13:19:37] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra │
│ [2024/05/09 13:19:41] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps │
│ [2024/05/09 13:19:42] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra │
│ [2024/05/09 13:19:43] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps │
│ [2024/05/09 13:19:44] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra │
│ [2024/05/09 13:19:44] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps │
│ [2024/05/09 13:19:45] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra │
│ [2024/05/09 13:19:45] [ warn] [output:loki:loki.0] Tenant ID is overwritten infra -> apps │
│ [2024/05/09 13:19:46] [ warn] [output:loki:loki.0] Tenant ID is overwritten apps -> infra
The warn is defined here: https://github.com/fluent/fluent-bit/blob/master/plugins/out_loki/loki.c#L1152
Why you used fluentbit not promtail to send data to Loki, I think promtail will good to use when with Loki
@YevhenLodovyi, are your log entries separated into different chunks, e. g. by making sure that they get re-emitted with a tag that contains the actual value of the tenant_id? Have a look at https://github.com/fluent/fluent-bit/issues/2935#issuecomment-808657942, that describes the issue quite well. It would be helpful if you provided the lua script and the rest of the fluent-bit config.
If your chunks are well aligned:
We are possibly encountering the same issue, where a race condition seems to be involved. I guess it was introduced (or at least not fixed) by PR #6931 where dynamic_tenant_id
gets shared within the thread. I guess due to usage of coroutines, the value changes between loki_compose_payload and flb_http_add_header.
But I have to admit that I don't see a context switch in between. Maybe @leonardo-albertovich could shed some light on it?
Though we are quite sure that the value changes in between, as we captured the traffic using tcpdump and the chunks are well aligned and contain log messages from (in our case) distinct namespaces only, but are tagged with the wrong X-Scope-OrgID header.
We have configured Tenant_id_key
to customer
and the request looks like this:
POST /loki/api/v1/push HTTP/1.1
Host: loki-gateway.logging.svc:80
Content-Length: 944
User-Agent: Fluent-Bit
Content-Type: application/json
X-Scope-OrgID: customer1
Connection: keep-alive
{"streams":[{"stream":{"job":"fluent-bit","node":"worker11","namespace":"customer2-namespace"},"values":[["1715604287815933056","{\"stream\":\"stdout\",\"logtag\":\"F\",\"message\":\"INFO trace_generator_slow - generate_slow_traces - SlowOperation created in customer2-namespace - Trace ID: 95f6274c96752cf944547abda39e24b1, Span ID: 5eaddc539554b1ec - 13/05/2024 Monday 12:44:47\",\"namespace_name\":\"customer2-namespace\",\"host\":\"worker11\",\"kubernetes\":{\"pod_name\":\"trace-generator-slow-job-swzcs\",\"namespace_name\":\"customer2-namespace\",\"container_name\":\"trace-generator-slow\",\"labels\":{\"batch.kubernetes.io/controller-uid\":\"f935d0ea-6e29-4be8-a379-0126cb2d2b8d\",\"batch.kubernetes.io/job-name\":\"trace-generator-slow-job\",\"controller-uid\":\"f935d0ea-6e29-4be8-a379-0126cb2d2b8d\",\"job-name\":\"trace-generator-slow-job\"}},\"customer\":\"customer2\",\"cluster\":\"cluster-name\"}"]]}]}
If this is a separate topic, I'll file an additional issue.
Update: We recompiled fluent-bit with FLB_OUTPUT_SYNCHRONOUS
flag set in out_loki.c. The issue still persists. Either the cause is something else, or we misinterpreted the meaning of the flag.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
This issue was closed because it has been stalled for 5 days with no activity.
still actual