mimir icon indicating copy to clipboard operation
mimir copied to clipboard

Docs: improve Planning capacity page

Open KMiller-Grafana opened this issue 2 years ago • 8 comments

Screen Shot 2022-03-11 at 2 48 05 PM

See the link under the heading "Monolithic mode?" It is a link to the next paragraph/section. Super unhelpful link for any reader that clicks on it, since it goes to the next sentence. Just remove it.

  1. Rename this section from "Planning capacity" to something more like "Estimating resource usage." The info under the headings "Monolithic mode" and "Microservices mode" don't give us any help on planning capacity. They do help a user to estimate resource usage.
  2. Consider changing "utilization" to "usage."

KMiller-Grafana avatar Mar 11 '22 23:03 KMiller-Grafana

We should also mention to use fast disks for ingesters and store-gateways (see https://github.com/grafana/mimir/issues/1722#issuecomment-1112789110).

pracucci avatar Apr 29 '22 12:04 pracucci

Maybe this will just be taken care of in https://github.com/grafana/mimir/issues/1988 but recently I was looking at the capacity planning page and was a bit confused when I read


CPU: 1 core for every 300,000 series in memory
Memory: 2.5GB for every 300,000 series in memory
Disk space: 5GB for every 300,000 series in memory

Is the idea that I calculate the total number of active series in my cluster and then figure out the cpu, memory, and disk space requirements for all ingesters in the whole cluster? How do I figure out how many ingesters I need and what the individual resources allocated to each ingester should be? Do I arbitrarily pick a number of ingesters and then just divide the total resource requirements by the number of ingesters?

09jvilla avatar Jun 10 '22 00:06 09jvilla

For the ingesters specifically, is the disk space requirement at all impacted by how many hours of data I want to retain on disk?

09jvilla avatar Jun 10 '22 00:06 09jvilla

I wonder if ingester disk usage would be better estimated as a function of DPM rather than active series.

In any case, I think the ingester sizing that @09jvilla points out is using some unstated assumptions about the scrape interval and retention period.

Logiraptor avatar Jun 10 '22 19:06 Logiraptor

The capacity planning doc was initially conceived to be a simplification and have 1 single metric per component to use for scaling (for ingesters I picked active series). I understand it was an oversimplification and it's showing its limits. My feeling is that documenting all proper math would make it quite complicated for the user, that's why I would move forward replacing it with a tool, where we incapsulate all our logic.

I wonder if ingester disk usage would be better estimated as a function of DPM rather than active series.

Yes, it would.

pracucci avatar Jun 12 '22 12:06 pracucci

Estimated high due to unactionable state of doc issue and necessary research if implemented.

osg-grafana avatar Jun 29 '22 14:06 osg-grafana

The guidelines for Alertmanager seem too low:

  • CPU: 1 CPU core for every 100 firing alerts
  • Memory: 1GB for every 100 firing alerts

Perhaps it was meant to say '100 firing alerts per second'? It does not seem right for a single alert to consume 10MB of RAM.

mac133k avatar Oct 04 '22 16:10 mac133k

The guidelines for Alertmanager seem too low:

@mac133k You're right. See my PR to update it: https://github.com/grafana/mimir/pull/3132

pracucci avatar Oct 05 '22 13:10 pracucci