fluent-bit
fluent-bit copied to clipboard
Throttle plugin does not filter k8s containers properly
When fluent-bit is deployed as a daemonset within a k8s cluster throttling per container is a must due to the multi-tenant nature of k8s.
Given the following config:
[FILTER]
Name kubernetes
Alias filter-kubernetes
Match kubernetes.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kubernetes.var.log.containers.
Merge_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
Kube_Meta_Cache_TTL 10
[FILTER]
Name throttle
Match kubernetes.pod_name
Alias throttle-kubernetes
Rate 1000
Window 300
Interval 1s
Print_Status true
With the above config it seems that fluent-bit will throttle ALL collective container logs from the node, even if only one container's logs reach the threshold.
I would expect only the logs of the container which breached the threshold to be throttled.
Do you mean for the pod there, you seem to be matching on pod_name
so it should match the logs for that pod? (Assuming that pod_name
is the actual name substituted in your config).
The filter will throttle based on the match it makes so if you match 10 different logs to it then the assumption is you want to throttle based on the aggregated rate of those 10, similarly if you route 1 log to it then you throttle based on that single rate. This is the intended behaviour I think: you may want to throttle prior to an output for an example so match for that output.
I think the request here is that you want to trigger the throttle per key value rather than per match? So you can have one filter with a trigger that is dynamic, e.g. you do not know the container names beforehand, and when it sees a key it implements throttling per key. This seems like a useful feature but I do not believe what the filter currently implements so feel free to contribute a PR for that functionality.
I have the same problem very often on openshift clusters, with a lot of containers running on each nodes for different customers. It happens easily, that an application run into errors and produces continuous 5000msg/s. And that can cause bad interdependence for read & forward logs for other containers. I had a look to the throttle filter, but in my opinion the best way to keep the load under control would be, to have in the input_tail an option to throttle the logmessages per inputfile (container log file).
I think the request here is that you want to trigger the throttle per key value rather than per match? So you can have one filter with a trigger that is dynamic, e.g. you do not know the container names beforehand, and when it sees a key it implements throttling per key. This seems like a useful feature but I do not believe what the filter currently implements so feel free to contribute a PR for that functionality.
@patrick-stephens yes that's exactly what I mean :) If it's not possible at the moment, how are people currently using fluentbit in a multi-tenant dynamic environment such as k8s? Without some sort of dynamic rate limiting one component can easily affect other components within the cluster (this is true for the entire operations stack).
@dudicoco - after digging in the sources - there is an undocumented plugin throttle_size, that can throttle per custom field: https://github.com/fluent/fluent-bit/commit/07a3cbd64c08faf50c52125b5b020bb1354917ae @patrick-stephens: is there any reason, why this plugin isn't in the docs?
Thanks @novegit!
The plugin is currently disabled by default, see https://github.com/fluent/fluent-bit/commit/305a390f4684619fa3fda9e535ded6b48c6976da.
It seems that in order to be able to use the plugin you would need to build fluentbit yourself: https://docs.fluentbit.io/manual/installation/sources/build-and-install#build-options.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
This issue was closed because it has been stalled for 5 days with no activity.