Add missing Fluentd input metric to fix empty panels
[!warning] This is a public repository, ensure not to disclose:
- [x] personal data beyond what is necessary for interacting with this pull request, nor
- [x] business confidential information, such as customer names.
What kind of PR is this?
Required: Mark one of the following that is applicable:
- [ ] kind/feature
- [ ] kind/improvement
- [ ] kind/deprecation
- [ ] kind/documentation
- [ ] kind/clean-up
- [ ] kind/bug
- [x] kind/other
Optional: Mark one or more of the following that are applicable:
[!important] Breaking changes should be marked
kind/admin-changeorkind/dev-changedepending on type Critical security fixes should be marked withkind/security
- [ ] kind/admin-change
- [ ] kind/dev-change
- [ ] kind/security
- [ ] kind/adr
What does this PR do / why do we need this PR?
While working on PR https://github.com/elastisys/compliantkubernetes-apps/pull/2242, I noticed we had panels in the fluentd dashboard that used metrics that we did not seem to expose. I found an example fluentd configuration to expose this metric here.
Information to reviewers
As mentioned in https://grafana.com/grafana/dashboards/13042-fluentd-1-x/:
Input filter by tag can produce insane amount of labels for metric
Hence tag ${tag_parts[0]} is used, which reduced the amount of labels for this metric quite significantly in my dev environment:
Which produces metrics for the tags seen below:
Checklist
- [x] Proper commit message prefix on all commits
- Change checks:
- [x] The change is transparent
- [ ] The change is disruptive
- [ ] The change requires no migration steps
- [ ] The change requires migration steps
- [ ] The change upgrades CRDs
- [ ] The change updates the config and the schema
- Metrics checks:
- [ ] The metrics are still exposed and present in Grafana after the change
- [ ] The metrics names didn't change (Grafana dashboards and Prometheus alerts are not affected)
- [ ] The metrics names did change (Grafana dashboards and Prometheus alerts were fixed)
- Logs checks:
- [ ] The logs do not show any errors after the change
- Pod Security Policy checks:
- [ ] Any changed pod is covered by Pod Security Admission
- [ ] Any changed pod is covered by Gatekeeper Pod Security Policies
- [ ] The change does not cause any pods to be blocked by Pod Security Admission or Policies
- Network Policy checks:
- [ ] Any changed pod is covered by Network Policies
- [ ] The change does not cause any dropped packets in the
NetworkPolicy Dashboard
- Audit checks:
- [ ] The change does not cause any unnecessary Kubernetes audit events
- [ ] The change requires changes to Kubernetes audit policy
- Falco checks:
- [ ] The change does not cause any alerts to be generated by Falco
- Bug checks:
- [ ] The bug fix is covered by regression tests
What values does the tag_parts[0] take? Is it kubernetes, kubeaudit, other, authlog? I see kernel and kubelet from the screenshot you shared.
Did this increase the fluentd resource usage by any significant amount?
What values does the
tag_parts[0]take? Is itkubernetes,kubeaudit,other,authlog? I seekernelandkubeletfrom the screenshot you shared.
It seems like it gets kubernetes, kubeaudit, authlog and then some other as you saw in the image, maybe we want to filter only on our normal indicies instead of .**?
Did this increase the fluentd resource usage by any significant amount?
Regarding resource usage, I first deployed it in a cluster with Calico running a version causing this issue, which generates a ton of error logs for Calico which did cause quite high CPU load for some of the forwarder pods:
Memory seemed to be about the same. After changing to a fixed patch version of Calico, the CPU usage was pretty much as before adding this change:
And it could also be seen as the input entries went down:
So the input metric was another indicator for the calico issue, but it also increased fluentd forwarders CPU usage quite a lot, but that might also have been since calico was using up a lot more CPU on the nodes than it should.
How will the behaviour be with index per namespace? Does it retain kubernetes as the tag or will it generate one tag per namespace?
How will the behaviour be with index per namespace? Does it retain
kubernetesas the tag or will it generate one tag per namespace?
Just tested enabling indexPerNamespace in my cluster and the tags seems to stay the same, e.g. kubernetes, kubeaudit etc.
How about changing the title pf the PR to something along the lines of
Expose number of incoming records per tag in fluentd via the metric
fluentd_input_status_num_records_total
I don't know but I see this more of an addition rather than a fix :sweat_smile: