Tuners may not be fully getting enabled on k8s clusters deployed via helm
What happened?
We're investigating a cpu imbalance on a large EKS-based cluster deployed via helm. Perf team indicates that the cpu imbalance is likely related to tuners not getting enabled on this cluster. The cluster is using our defaults to enable tuners. We dug through the tuning container and see the following, suggesting only the aio tuner is being applied.
$ kubectl logs redpanda-0 -c tuning
TUNER APPLIED ENABLED SUPPORTED ERROR
aio_events true true true
ballast_file false false true
clocksource false false false Clocksource setting not available for this architecture
coredump false false true
cpu false false true
disk_irq false false true
disk_nomerges false false false Directory '' does not exists
disk_scheduler false false false Directory '' does not exists
disk_write_cache false false false Directory '' does not exists
fstrim false false false dial unix /run/systemd/private: connect: no such file or directory
net false false true
swappiness false false true
transparent_hugepages false false true
Looking through the statefulset it appears that we fire off rpk redpanda tune all in a privileged container if tune_aio_events is true (which it is by default).
https://github.com/redpanda-data/helm-charts/blob/9261d130d1a486526f5c2c0437c11d03b91ab43d/charts/redpanda/templates/statefulset.yaml#L69-L91
But ... rpk redpanda tune all requires some configuration hints in redpanda.yaml to know which tuners to actually apply. It doesn't seem like we actually place those hints into the config so that tune all does anything.
confusingly, the following values.yaml suggests that most of these tunings are not valid in containerized environments, but I can find no history/indication why this is true (other than it was asserted sometime in 2022). The only thing I can assume at this point is that the tuners that manipulate sysctl were shown to work at some point in the past, but that the ones manipulating files in /sys (which is most of the interrupt/cpu stuff) may NOT have worked. Discussion from Perf team suggests that we need these in any environment (k8s or deployed to OS) no matter what, and discussion with @c4milo suggest that we have this working in the operator used for cloud/byoc.
https://github.com/redpanda-data/helm-charts/blob/b5469209ffa05ab8050d260fc685365b899bc4f4/charts/redpanda/values.yaml#L794-L834
The hints that would be needed in the redpanda.yaml:
rpk:
tune_network: true
tune_disk_scheduler: true
tune_disk_nomerges: true
tune_disk_write_cache: true
tune_disk_irq: true
tune_cpu: true
tune_aio_events: true
tune_clocksource: true
tune_swappiness: true
coredump_dir: /var/lib/redpanda/coredump
tune_ballast_file: true
What did you expect to happen?
I expect the tuners to be configurable in the values.yaml and have those tunables get applied when the tuning container runs.
How can we reproduce it (as minimally and precisely as possible)?. Please include values file.
Can provide values example separately, but generally should occur with any default one we have at this point.
Anything else we need to know?
See also https://redpandadata.slack.com/archives/C01H6JRQX1S/p1706816485871719
See also interrupt channel 464 and more specifically this thread https://redpandadata.slack.com/archives/C06E573MBGE/p1706804403275249
Which are the affected charts?
Redpanda
Chart Version(s)
problem occurs in 5.6.60 and whatever latest was as of Feb 1, 2024.
Cloud provider
Self-hosted on AWS EKS, but likely affects any helm-managed k8s install.
JIRA Link: K8S-101
Note: if the expectation is that we should instead be applying tuner configs on the base-os of the k8s node, then we need to figure out how to get rpk redpanda tune all --output-script to work correctly given we still end up not having enough hints (afaict) in redpanda.yaml to make each tuner run to generate the script snippet.
We also need to figure out a way to get that deployed to the underlying nodes and run on every system startup (so also generating some sort of systemd unit files or something else on startup).
The way base-os installs get away with this is the redpanda rpm/deb deploy a systemd unit file for redpanda-tuner.service that runs rpk redpanda tune all, but by that point we have enough populated in redpanda.yaml for tuner configs to fire off.
We probably cannot tell users to also install redpanda rpm/deb on base system as that defeats the purpose of having separate OS and K8s installs.
Relevant docs when work begins: https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/
Following up from our slack thread.
tuners can be enabled/configured via the tuning stanza provided that tune_aoi_events is true.
Here we see tune_cpu results in cpu's APPLIED value being set to true
❯ helm install redpanda redpanda/redpanda --create-namespace --version 5.6.60 --set 'tuning.tune_cpu=true'
❯ kubectl --namespace default logs redpanda-0 -c tuning
TUNER APPLIED ENABLED SUPPORTED ERROR
aio_events true true true
ballast_file false false true
clocksource false false false Clocksource setting not available for this architecture
coredump false false true
cpu true true true
disk_irq false false true
disk_nomerges false false false Directory '' does not exists
disk_scheduler false false false Directory '' does not exists
disk_write_cache false false false Directory '' does not exists
fstrim false false false dial unix /run/systemd/private: connect: no such file or directory
net false false true
swappiness false false true
transparent_hugepages false false true
Whether or not enabling these tuners actually does anything for Redpanda remains an open question.
@hcoyote What do you think the resolution of this ticket should be? Seems like the best option for now might be updating our documentation to further explain why the tuner doesn't work within Kubernetes and instead suggesting that users utilize cloud-init or similar? I wouldn't be opposed to removing the tuner entirely FWIW.
I don't know what the viable solution is right now.
I think we need weigh in from @c4milo and probably @StephanDollberg at minimum. I think the assertion is that, for performance and supportability, we need to get tuners reliably and consistently applied no matter what the deployment methodology is (e.g., bare-os/self-hosted k8s, cloud, etc).
The work around we have today for EKS is to do this via cloud-init. Camilo was working on some daemonset stuff to make this work in AKS, so maybe that's something we can pull back to helm?
Whatever we do for cloud is probably similar to what we should do for self-hosted k8s (on cloud at least). We still need to determine a suitable answer for self-hosted k8s on multi-tenant shared k8s infra.