kube-prometheus Bring KSM autoscaler addon up to standards

What is missing?

This is about refactoring ksm-autoscaler.libsonnet into following standard and being well integrated with the KSM component, as well as adding more flexibility into its configuration.

Here are my enhancement suggestions:

Move $.values.clusterVerticalAutoscaler to $.values.kubeStateMetrics.autoscaler
Move $.ksmAutoscaler.<resource> to $.kubeStateMetrics.autoscaler<resource> (when the addon is enabled, resources are automatically in the loop to deploy ksm resources)
Add standard values: image, commonLabels, selectorLabels, resources...
Use ksm._config.name as prefix for resource names
Generate configuration from a values field with std.manifestJsonEx(), allowing full flexibility (core or node steps, sidecar auto-scaling...)
Auto-scale kube-rbac-proxy-main
Unset ksm resources to avoid downscaling when applying manifest
Remove RBAC for removed extensions API group
Upgrade CPVPA image to latest version

Of course, many of these changes would be breaking, I would bundle them all at once.

Why do we need it?

Provide a clean add-on experience

Anything else we need to know?:

Nov 24 '21 23:11 maxbrunet

I agree, the autoscaler deployment code is not in the right state and should be refactored. To make UX even better we would also need:

at least one simple example of how to use the autoscaler (in examples/).
a document in docs about autoscaler usage (when? why? how? etc.)

Few notes and suggestions to the proposal inline:

Move $.values.clusterVerticalAutoscaler to $.values.kubeStateMetrics.autoscaler

Move $.ksmAutoscaler. to $.kubeStateMetrics.autoscaler

Mixing addon settings with components settings is something that should be avoided to have a clean separation of duty. If autoscaler goes into $.values.kubeStateMetrics it would also need to stop being an addon and go into https://github.com/prometheus-operator/kube-prometheus/blob/main/jsonnet/kube-prometheus/components/kube-state-metrics.libsonnet.

Add standard values: image, commonLabels, selectorLabels, resources...

Maybe we should convert autoscaler into an intermediary component (like kube-rbac-proxy) instead and allow managing not only KSM? We could potentially apply VPA to prometheus, if that even makes sense :smile:

Upgrade CPVPA image to latest version

If autoscaler would be a part of a component, we could start tracking version in our auto-updater script - https://github.com/prometheus-operator/kube-prometheus/blob/main/scripts/generate-versions.sh

It looks to me like the autoscaler should be a top-level component and not just an addon. However, its relation with KSM is a bit similar to the one between prometheus-operator and prometheus and it is not crucial to KSM operations. Due to this, I am not convinced autoscaler should live in KSM component, but I do think it should still be a component instead of an addon. With enough thought we could make autoscaler component to be generic enough to allow using it for managing resources of other components (maybe blackbox_exporter, maybe prometheus,...)

Potentially something like the following could be used to implement the autoscaler component and use not only for KSM:

local defaults = {
  namespace:: error 'must provide namespace',
  image:: error 'must provide image',
  resources:: {
    requests: { cpu: '10m', memory: '20Mi' },
    limits: { cpu: '20m', memory: '40Mi' },
  },
  target: {
    namespace: 
    object: 
  },
  config: {
    // VPA configuration for resources. Essentially https://github.com/prometheus-operator/kube-prometheus/blob/main/jsonnet/kube-prometheus/addons/ksm-autoscaler.libsonnet#L94-L101. It would be nice if configuration fields were aligned with VPA types definition for the config section.
  }

Nov 25 '21 10:11 paulfantom

Mixing addon settings with components settings is something that should be avoided to have a clean separation of duty. If autoscaler goes into $.values.kubeStateMetrics it would also need to stop being an addon and go into main/jsonnet/kube-prometheus/components/kube-state-metrics.libsonnet.

In my view, the add-on can behave like a patch to a component. Most add-ons seem to be this way currently. I think I will follow my idea on this point when opening the PR, and if you still do not like it when it is all put together, moving it from one file to the other should not be hard.

However, its relation with KSM is a bit similar to the one between prometheus-operator and prometheus and it is not crucial to KSM operations.

The autoscaler is more tightly couple with KSM, it needs to know the deployment name and its namespace directly in its arguments

Maybe we should convert autoscaler into an intermediary component (like kube-rbac-proxy) instead and allow managing not only KSM? We could potentially apply VPA to prometheus, if that even makes sense 😄

With enough thought we could make autoscaler component to be generic enough to allow using it for managing resources of other components (maybe blackbox_exporter, maybe prometheus,...)

Maybe as a later consideration? I do not use blackbox, and I do not like the idea of auto-scaling Prometheus, it could create downtime (e.g. large instances take time to replay WAL on restart, large pods could get stuck in Pending state (especially that I do not auto-scale node groups with StatefulSets at the moment)), I prefer having an alert on resource usage and decide if I want to increase them or add a new Prometheus pair/shard and move endpoints there. And yes, not sure if scaling proportionally to the cluster size makes sense for these components either

I can look into incorporating some documentation with the change. Please let me know if you are satisfied with the scope, and I'll open a tentative PR.

Nov 25 '21 21:11 maxbrunet

Worth noting KSM has an experimental feature for sharding^automated-sharding, so I am think about renaming the addon to ksm-vertical-autoscaler.libsonnet, and we can later look into a ksm-horizontal-autoscaler addon with https://github.com/kubernetes-sigs/cluster-proportional-autoscaler. I will not worry about key collisions between the two since they should not be used together.

Nov 26 '21 06:11 maxbrunet

In my view, the add-on can behave like a patch to a component.

Yes, that is what we aim for, but the patch cannot implement new configuration fields. This proposal instead mixes the configuration parameters of an addon and component. Such an approach is already proven to cause problems for other users.

Additionally, since it is conditionally adding parameters, it prevents us from implementing a stable interface for passing configuration (WIP in https://github.com/prometheus-operator/kube-prometheus/pull/1477).

I do not use blackbox

You may not use it but it doesn't mean the project is not shipping it. As such we need to consider a broader picture and not only partial solutions.

I do not like the idea of auto-scaling Prometheus, it could create downtime (e.g. large instances take time to replay WAL on restart, large pods could get stuck in Pending state

Changing resources of StatefulSet shouldn't cause downtime for the whole prometheus service and you should have at least one replica running at all times. Plus Prometheus StatefulSet implements startupProbes which should prevent the second pod from going down when the first is still replaying WAL. Those resiliency details should allow to safely implement VPA for prometheus and it would be useful with growing cluster size (by default in kube-prometheus, prometheus memory consumption is directly related to a number of pods and nodes).

And if something from what I described doesn't work as mentioned, it is a bug in prometheus-operator (or possibly in kubernetes StatefuleSet controller).

The autoscaler is more tightly couple with KSM, it needs to know the deployment name and its namespace directly in its arguments

Worth noting KSM has an experimental feature for sharding1, so I am think about renaming the addon to ksm-vertical-autoscaler.libsonnet, and we can later look into a ksm-horizontal-autoscaler

In the case of HPA we do have a tight coupling as KSM needs to be configured differently to allow HPA to work (KSM needs to be deployed as StatefulSet and needs to have 2 CLI flags set). However in the case of VPA we don't need to do anything with KSM Deployment, but we only need to notify VPA what is the target it is managing.

As such VPA implementation is NOT coupled with KSM (it is coupled with kubernetes) and I don't see any reason we need to have this as an addon when we clearly can create a component out of it and use it also for other purposes. Bear in mind that managed deployment name and namespace can be easily passed as part of component options (target section in an example from https://github.com/prometheus-operator/kube-prometheus/issues/1524#issuecomment-979059627).

In the case of HPA, I would require it to be a part of the KSM component due to different provisioning mechanisms depending on the enablement of the sharding feature.

Nov 28 '21 18:11 paulfantom

kube-prometheus kube-prometheus copied to clipboard

Bring KSM autoscaler addon up to standards

kube-prometheus
kube-prometheus copied to clipboard