compliantkubernetes-apps Add missing Fluentd input metric to fix empty panels

[!warning] This is a public repository, ensure not to disclose:

[x] personal data beyond what is necessary for interacting with this pull request, nor

[x] business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

[ ] kind/feature
[ ] kind/improvement
[ ] kind/deprecation
[ ] kind/documentation
[ ] kind/clean-up
[ ] kind/bug
[x] kind/other

Optional: Mark one or more of the following that are applicable:

[!important] Breaking changes should be marked kind/admin-change or kind/dev-change depending on type Critical security fixes should be marked with kind/security

[ ] kind/admin-change
[ ] kind/dev-change
[ ] kind/security
[ ] kind/adr

What does this PR do / why do we need this PR?

While working on PR https://github.com/elastisys/compliantkubernetes-apps/pull/2242, I noticed we had panels in the fluentd dashboard that used metrics that we did not seem to expose. I found an example fluentd configuration to expose this metric here.

Information to reviewers

As mentioned in https://grafana.com/grafana/dashboards/13042-fluentd-1-x/:

Input filter by tag can produce insane amount of labels for metric

Hence tag ${tag_parts[0]} is used, which reduced the amount of labels for this metric quite significantly in my dev environment:

Which produces metrics for the tags seen below:

Checklist

[x] Proper commit message prefix on all commits
Change checks:
- [x] The change is transparent
- [ ] The change is disruptive
- [ ] The change requires no migration steps
- [ ] The change requires migration steps
- [ ] The change upgrades CRDs
- [ ] The change updates the config and the schema
Metrics checks:
- [ ] The metrics are still exposed and present in Grafana after the change
- [ ] The metrics names didn't change (Grafana dashboards and Prometheus alerts are not affected)
- [ ] The metrics names did change (Grafana dashboards and Prometheus alerts were fixed)
Logs checks:
- [ ] The logs do not show any errors after the change
Pod Security Policy checks:
- [ ] Any changed pod is covered by Pod Security Admission
- [ ] Any changed pod is covered by Gatekeeper Pod Security Policies
- [ ] The change does not cause any pods to be blocked by Pod Security Admission or Policies
Network Policy checks:
- [ ] Any changed pod is covered by Network Policies
- [ ] The change does not cause any dropped packets in the NetworkPolicy Dashboard
Audit checks:
- [ ] The change does not cause any unnecessary Kubernetes audit events
- [ ] The change requires changes to Kubernetes audit policy
Falco checks:
- [ ] The change does not cause any alerts to be generated by Falco
Bug checks:
- [ ] The bug fix is covered by regression tests

Aug 26 '24 14:08 anders-elastisys

What values does the tag_parts[0] take? Is it kubernetes, kubeaudit, other, authlog? I see kernel and kubelet from the screenshot you shared.

Did this increase the fluentd resource usage by any significant amount?

Sep 02 '24 08:09 OlleLarsson

What values does the tag_parts[0] take? Is it kubernetes, kubeaudit, other, authlog? I see kernel and kubelet from the screenshot you shared.

It seems like it gets kubernetes, kubeaudit, authlog and then some other as you saw in the image, maybe we want to filter only on our normal indicies instead of .**?

Did this increase the fluentd resource usage by any significant amount?

Regarding resource usage, I first deployed it in a cluster with Calico running a version causing this issue, which generates a ton of error logs for Calico which did cause quite high CPU load for some of the forwarder pods: Screenshot from 2024-09-05 16-04-15

Memory seemed to be about the same. After changing to a fixed patch version of Calico, the CPU usage was pretty much as before adding this change:

And it could also be seen as the input entries went down: Screenshot from 2024-09-06 09-28-59

So the input metric was another indicator for the calico issue, but it also increased fluentd forwarders CPU usage quite a lot, but that might also have been since calico was using up a lot more CPU on the nodes than it should.

Sep 06 '24 09:09 anders-elastisys

How will the behaviour be with index per namespace? Does it retain kubernetes as the tag or will it generate one tag per namespace?

Sep 09 '24 06:09 aarnq

How will the behaviour be with index per namespace? Does it retain kubernetes as the tag or will it generate one tag per namespace?

Just tested enabling indexPerNamespace in my cluster and the tags seems to stay the same, e.g. kubernetes, kubeaudit etc.

Sep 09 '24 14:09 anders-elastisys

How about changing the title pf the PR to something along the lines of

Expose number of incoming records per tag in fluentd via the metric fluentd_input_status_num_records_total

I don't know but I see this more of an addition rather than a fix :sweat_smile:

Sep 16 '24 12:09 OlleLarsson