multicluster-observability-operator [ACM-8879]: Add initial implementation for T-shirt sizing ACM o11y resources

trafficstars

This PR adds ReadTShirtSize and WriteTShirtSize configuration options on the front-level MulticlusterObservabilitySpec that allows the user to set their read and write path components to one of the following sizes based on their needs: small;medium;large;xlarge;2xlarge;4xlarge;8xlarge

We split the t-shirt sizing config option into two, as users may potentially have hubs that are write-heavy or read-heavy, so one can optimize for efficient utilization of their resources.

This PR also refactors the resource configuration functions significantly, to allow this to be maintainable, and provide more clarity. The major files to review here are: operators/multiclusterobservability/pkg/config/resources.go and operators/multiclusterobservability/pkg/config/resources_map.go

For now, all the t-shirt size resource options are set the same, as we probably want to benchmark and set expectations for each of these sizes.

https://issues.redhat.com/browse/ACM-8879

Dec 12 '23 04:12 saswatamcode

I'm wondering - does it make sense to call these 'TShirtSizes'? Perhaps something like a ConfigProfile might translate better? or InstanceSizes or InstanceTypes - eg like how you'd read a list of AWS EC2 instances types

Dec 12 '23 17:12 berenss

@berenss I don't mind changing the naming. Tshirt sizes seemed more natural as that is what gives one a notion of a linear scale, without having to define anything specifically. So we usually see this even on PRs sizes for example.

But yes we can go with something like instanceSize as well. Curious to know what others think :)

Dec 13 '23 04:12 saswatamcode

@thibaultmg yup, I think so too. We will probably need to write some test for this and also test upgrade/update scenarios

Dec 13 '23 09:12 saswatamcode

@saswatamcode , I am setting aside the naming for now. I am sure this will be discussed elsewhere. A few questions -

How am I as user supposed to know which value to choose
What is done to my system after I choose a value
Any reason why the choices seem like tied to aws ec2 instance sizes or I am reading too much into it?

Since this is optional, I guess when we upgrade the MCO from a version where this does not exist, we will not have any issues.

Dec 19 '23 01:12 bjoydeep

@bjoydeep to answer your questions,

How am I as user supposed to know which value to choose

I believe we will be documenting this for end users, once the feature is ready to be merged. We would probably link each size value to an ingest and query benchmark (in terms of samples per second, headseries, QPS, samples touched per query). User can then make a choice based on their current value (in case they want to scale up/down). We can probably document the promql queries for measuring these as well.

What is done to my system after I choose a value

The observability resources are scaled up/down. This feature probably needs a check/alert, in case a size is picked that the current cluster won't have the resources to support.

Any reason why the choices seem like tied to aws ec2 instance sizes or I am reading too much into it?

These are just tshirt sizes, I didn't really think of aws ec2. We can rename them to something else as well, the naming matters less here. 🙂

Since this is optional, I guess when we upgrade the MCO from a version where this does not exist, we will not have any issues.

Yup, we can have a default size, which won't break user setups. Users using AdvancedConfig can continue doing so as well.

Dec 19 '23 03:12 saswatamcode

Quality Gate failed

Failed conditions

66.8% Coverage on New Code (required ≥ 70%)

See analysis details on SonarCloud

Dec 19 '23 08:12 sonarqubecloud[bot]

@saswatamcode Thank You !

Makes sense. I was also worried about This feature probably needs a check/alert, in case a size is picked that the current cluster won't have the resources to support.. Say if we are scaling up a receiver with a PV - the system has to have the ability to provision a PV automatically - not every system has that. Also if more than 1 receivers are running on a node, network may become a problem etc.

Just thinking aloud here -- what if the first baby step is a small utility that queries the system (acm hub prometheus) and just spits out recommendation. And the customer may use that recommendation and be able to tweak the MCO to achieve that configuration?

What you have here is fine - but is there a baby step even before that? Just poking on that. Once things are in the CR, changing this is hard (supporting upgrade/backward compatibility etc as you know is not cheap)

Dec 19 '23 22:12 bjoydeep

@bjoydeep thanks for your comments!

Say if we are scaling up a receiver with a PV - the system has to have the ability to provision a PV automatically - not every system has that. Also if more than 1 receivers are running on a node, network may become a problem etc.

This feature isn't an autoscale. With benchmarking, we would be able to provide accurate estimates to customers, on what resources they need to have before switching to a larger InstanceSize. If the system can scale those automatically that's great, if not customer can refer to docs, and scale on their own before moving up a size. The check/alert can be added as a precaution (similar to panic in Go), i.e it shouldn't happen unless something is very wrong.

what if the first baby step is a small utility that queries the system (acm hub prometheus) and just spits out recommendation. And the customer may use that recommendation and be able to tweak the MCO to achieve that configuration?

So my thoughts for this feature is that it would be fully backward compatible. We would have an InstanceSize default, which would be exactly the same as what is there currently as the default. And it is possible for InstanceSize config to be completely overridden by AdvancedConfig. So all existing customers can continue to operate exactly as before.

IMO, the sizing recommendation tool is useful for something like upstream Thanos, where we don't know and have no control over how an adopter of Thanos chooses to setup. But for ACM, we are the ones who are controlling exact architecture, so it makes sense for us to just provide a quick option to allow user to adjust their size, without having to delve deep into Thanos internals. Also, I think we already had some sort of tool like this before in https://github.com/stolostron/capacity-planning, but my understanding is that this is not used as frequently by customers/support.

So if this feature is going to backwards compatible, and can help the user size their setup comfortably by just changing a single field in CRD, I'd argue this feature can be rolled out? And if a step 0 is really needed, we can improve on the existing capacity-planning tool, and recommend customers to pick an InstanceSize to configure, via that! 🙂 Wdyt?

Dec 20 '23 07:12 saswatamcode

hey @bjoydeep , @saswatamcode interesting discussion, just a few questions and comments from my side:

@bjoydeep

Say if we are scaling up a receiver with a PV - the system has to have the ability to provision a PV automatically - not every system has that.

Is this typical of our customer base that dynamic provisioning of storage is not available? I ask because this is an important point for any future decisions are HPA on ingest/query path?

Also if more than 1 receivers are running on a node, network may become a problem etc.

If ACM is allowing this situation to happen in any environment that is not a demo/test environment then this is already an architectural failure IMHO. The purpose of distributing the hashring on ingest is to support data redundancy and fault tolerance/HA. If we are landing multiple receivers on a node and there is downtime on that node, either from expected (cluster upgrade, ...) or unexpected (any node issues) then we have a situation where quorum will not be reached and we have significant downtime on the ingest path as well as a likely problematic restoration of service. In such cases if customers don't want or need these features they would be better to run a large Prometheus and not Thanos in the first place.

what if the first baby step is a small utility that queries the system (acm hub prometheus) and just spits out recommendation.

I might be missing something, but I am struggling to see the value in this? If they can already query a healthy hub to determine how they need to scale then they in fact do not need to scale.

What you have here is fine - but is there a baby step even before that? Just poking on that. Once things are in the CR, changing this is hard

Totally agree with this. However the usefulness of an operator is to ease burden of day 0 and day 1 operations. If we can't enable that and need a customer -> support -> engineering loop each time we want to enable a customer then we are fighting a hard battle.

@saswatamcode

So my thoughts for this feature is that it would be fully backward compatible.

This needs to be key. Any change cannot disrupt an existing user or alter their configuration and request additional resources. If we make it opt-in then I don't see a major problem?

Also, I think we already had some sort of tool like this before in https://github.com/stolostron/capacity-planning, but my understanding is that this is not used as frequently by customers/support.

I also wondered a bit about this. Seems like we already have a tool in place to do just as @bjoydeep suggested. Would be interested to hear how valuable or otherwise that was but it still requires this support -> eng loop instead of just guiding and enabling customers to self-service.

Dec 20 '23 09:12 philipgough

if a customer has already performed some tuning (either by their own hand or with the help of RH consulting / RH support / RH engineering / upstream community) will we be able to gracefully accept those tuned params when they 'opt-in' ? or should they expect that the opt-in [re]configures their environment to some well known hard facts that RH prescribed.

In other words, something to warn a user

hey we found some things are already configured
be aware that this InstanceSize opt-in will blast that away

Even a doc note is sufficient if we cannot encode the smarts.

Dec 20 '23 18:12 berenss

fantastic discussion - @philipgough @saswatamcode !

I am guessing from above that till we can auto scale, we will not add this to the CR
Yes, we can use something like https://github.com/bjoydeep/acm-inspector to collect the data and spit out recommendation. Infact we already collect some thanos metrics - this needs more metrics - https://github.com/bjoydeep/acm-inspector/blob/main/src/supervisor/thanos.py#L22-L27 It would be easy to do some maths on this to give a recommendation. If you guys can tell the algorithm/rules in word, writing the code is easy. So customer could run the inspector and look at the recommendation and then use MCO-CR to tune the system. As you will see, we already gather node sizes, quantity, roles etc. So even recommending where to launch these should be possible.

Does that make sense?

If ACM is allowing this situation to happen in any environment that is not a demo/test environment then this is already an architectural failure IMHO. The purpose of distributing the hashring on ingest is to support data redundancy and fault tolerance/HA. If we are landing multiple receivers on a node and there is downtime on that node, either from expected (cluster upgrade, ...) or unexpected (any node issues) then we have a situation where quorum will not be reached and we have significant downtime on the ingest path as well as a likely problematic restoration of service. In such cases if customers don't want or need these features they would be better to run a large Prometheus and not Thanos in the first place.

@philipgough this is already what we have told {Customer} BTW. They are running multiple receivers on a node. Since they do not run dedicated hardware for observability and since we do not provide clear guidelines for HW configuration - which is the goal you guys are driving at - it becomes a little hard. Having said that, in 2.9, we have provided ootb alert to detect high network sync latency. We probably will need to add a few more alerts based on what you observed above - at least verify.

Is this typical of our customer base that dynamic provisioning of storage is not available? @philipgough I really do not know. Anecdotally, I have seen both. But, you know perhaps it is ok to say that you will not be able to use this feature unless you can provision storage dynamically.WDYT? @berenss do we have any visibility of this thru telemetry or otherwise.

Dec 21 '23 04:12 bjoydeep

if a customer has already performed some tuning (either by their own hand or with the help of RH consulting / RH support / RH engineering / upstream community) will we be able to gracefully accept those tuned params when they 'opt-in' ? or should they expect that the opt-in [re]configures their environment to some well known hard facts that RH prescribed.

@berenss this PR currently accounts for this. If the customer has performed some tuning, those would be respected and will be treated as an "override" to the InstanceSize config, so those would continue to function. They will also be able to tune it separately and override the configs from the InstanceSize config later on.

For example let's say a customer tuned their Ruler to 10Gi, and decided to use medium InstanceSize. So the resulting equation would be

Medium t-shirt config: Ruler: 512Mi, Receive: 512Mi, Query: 2Gi + User AdvancedConfig: Ruler: 10Gi = Final config: Ruler: 10Gi, Receive: 512Mi, Query: 2Gi

We will likely warn the user that some things will change when they change size, but their custom configs will be propagated. 🙂

I am guessing from above that till we can auto scale, we will not add this to the CR

@bjoydeep I'm not sure what you mean here. Afaiu, some customers might not have cluster node/PV autoscaling. But why would that block this sizing feature? 🤔

We plan to document every t-shirt InstanceSize here, and specify the minimum compute and disk needed to run the entirety of ACM o11y at that size, with a breakdown per component. So a customer can choose to scale their cluster manually before trying to upgrade the size. We will also warn the user (via UI/CLI) that the cluster needs to be scaled if it is not at the level we expect for a particular size.

And this feature will be fully backward compatible, so this is strictly an opt-in (but recommended) feature.

So we don't have any dependency on auto-scale here as far as I can see. Feel free to correct me if I'm wrong somewhere! 🙂

Yes, we can use something like https://github.com/bjoydeep/acm-inspector to collect the data and spit out recommendation. Infact we already collect some thanos metrics - this needs more metrics - https://github.com/bjoydeep/acm-inspector/blob/main/src/supervisor/thanos.py#L22-L27 It would be easy to do some maths on this to give a recommendation. If you guys can tell the algorithm/rules in word, writing the code is easy. So customer could run the inspector and look at the recommendation and then use MCO-CR to tune the system. As you will see, we already gather node sizes, quantity, roles etc. So even recommending where to launch these should be possible.

Yup, we can use a recommender like this together with this feature as well! So a user could get a recommended size for their needs or for the cluster size they have.

But I really want to avoid the customer having to tune the system, by configuring MCO CR. Ideally, the customer really shouldn't need to. My objective with adding this t-shirt sizing feature is that the customer can just "set and forget" a size based on the needs they have (and scale their cluster beforehand if they need to). With this feature, the customer would not need to worry about sizing every single component on the query and ingest path and does not need to be familiar with the intricacies of the components Thanos or otherwise. And AdvancedConfig can aim to be an escape hatch here, in case the customer has some uncommon use case that needs special sizing.

Asking the customer to manually tune such a setup, IMO is akin to me going to a garage to fix my car, and the mechanic handing me tools to work on the problem 😅.

This feature is also not a novel idea, logging team has done similar work on Loki Operator t-shirt sizing and now ships that with great success.

But all these discussions make it clear, that we need to decide on the semantics of how this feature will be used by customers, irrespective of how it works internally in MCO. So @bjoydeep, @berenss, @philipgough let's aim to discuss the semantics and guardrails needed for this feature, in a call (maybe arch call?) in Jan! 🙂 In the meantime will try to progress with some benchmarks!

Dec 21 '23 06:12 saswatamcode

@bjoydeep

this is already what we have told {Customer} BTW. They are running multiple receivers on a node. Since they do not run dedicated hardware for observability and since we do not provide clear guidelines for HW configuration - which is the goal you guys are driving at - it becomes a little hard. Having said that, in 2.9, we have provided ootb alert to detect high network sync latency. We probably will need to add a few more alerts based on what you observed above - at least verify.

This is something we need to change. Like I said this is what I consider to be a misconfiguration and makes things unreliable. Customers are always going to face challenges in scaling in that case because the foundations are broken. Alerts wont fix this.

But I really want to avoid the customer having to tune the system, by configuring MCO CR. Ideally, the customer really shouldn't need to. My objective with adding this t-shirt sizing feature is that the customer can just "set and forget" a size based on the needs they have (and scale their cluster beforehand if they need to).

This is how I view things also. We can integrate with the tool and produce some InstanceSize that they can configure in the CR.

Essentially we would, from the offset at least need to know how many metrics (both custom and otherwise) will be shipped from each spoke and how many spokes they intend to attach to a hub. That will allow us to determine required CPU and memory for the receivers at least and we can do some educated guesses on the query path too with that info and combined on things like retention and rate of queries.

Dec 21 '23 12:12 philipgough

Yes, I think we need to talk :) @saswatamcode @philipgough @berenss . We may be saying similar things in different words! Keep on progressing with this

feature will be used by customers, irrespective of how it works internally in MCO.

It will be needed for delightful customer experience.

Dec 21 '23 15:12 bjoydeep

/retest

Jun 17 '24 07:06 saswatamcode

Quality Gate failed

Failed conditions
67.0% Coverage on New Code (required ≥ 70%)

See analysis details on SonarCloud

Jun 17 '24 13:06 sonarqubecloud[bot]

@saswatamcode: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/sonarcloud	dc904a69ebe9475b90132eef8f35da338e2ed6ba	link	true	`/test sonarcloud`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Jun 17 '24 13:06 openshift-ci[bot]

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: coleenquadros, philipgough, saswatamcode, thibaultmg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [coleenquadros,philipgough,saswatamcode,thibaultmg]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jun 17 '24 15:06 openshift-ci[bot]

multicluster-observability-operator multicluster-observability-operator copied to clipboard

[ACM-8879]: Add initial implementation for T-shirt sizing ACM o11y resources

Quality Gate failed

Quality Gate failed

multicluster-observability-operator
multicluster-observability-operator copied to clipboard