prometheus
prometheus copied to clipboard
Head series limit
One of the most difficult aspects of operating Prometheus is staying on top of resource usage, which mostly means memory. The problem is that if Prometheus scrapes too many metrics then they all end up in memory and it takes time to recover from that Very often recovering from that situation is not easy, since scraped metrics will create chunks, those will be mmapped and Prometheus might read those on startup, recreating high memory usage. It would be great if I could crunch the numbers, do some capacity planning and just tell Prometheus "don't ever have more than X time series".
I can set per scrape sample limits but there are two problems with that:
- doing capacity planning & global level limiting by setting individual scrape job limits is difficult - number of scrape jobs keeps changing, the number of metrics exported by each job varies even with no changes to the system (some metrics are only exported when there are errors or other conditions enable them), this makes it hard to just take a big number (our global limit) and come up with limits per scrape job
- current sample limits are hard limits - if a target exposes more metrics then they are all discarded - the reasoning for that is valid (we have no way of deciding which samples from the response to keep and which to discard, so it's just easier to discard everything) but from operational point of view limits the usefulness of those limits. If I have a critical service and set a sample limit of 1000 for it and then it goes from 999 samples scraped to 1001 samples, then I loose all visibility into that service, all of that because it's one sample over the limit. It will take time to recover from that situation during which I have no idea if my critical service is healthy or not.
Because of the above I would much prefer to have a soft limiting in place. "Soft" here would mean that:
- the limit defines the number of time series being tracked by tsdb
- if we're under the limit we accept all appends
- if we're above the limit then we accept appends that don't create new time series
Since the problem is number of time series in memory (HEAD) I could tell HEAD to never store more than N series at any point in time. If HEAD reaches the limit than all appends that would create new time series will fail. Any append that just updates existing series will work. This way Prometheus scrapes as much as it is allowed to, without the risk of running out of memory.
In a scenario where Prometheus runs happily with HEAD series below the limit, but then one of the scrape jobs suddenly exposes a huge number of time series that pushes HEAD over the limit, this load shedding would stop us from running out of memory. Once problematic service stops exporting those excess metrics, a block gets written and HEAD GC executed, HEAD series will drop below the limit again and everyone will be happy again.
Now the part that tells HEAD to stay below the limit is easy, only a few lines. But Prometheus behavior for scrapes and rule evaluation requires a lot more code. I think this PR covers that, but there's a good chance I've missed something.
Ideally it would be great to also have a "soft limit" version of sample_limit
per scrape target, but that might be exercise for later.
Opening this as a draft since I expect this to be a big change that might require some bigger discussion.
I strongly disagree with those changes. It makes it very difficult to understand the data that's in and the data that's missing. I don't think this feature belongs in Prometheus. I am open to finding alternatives, but scrapes should be ingested in full or not ingested at all, same for recording rules.
What alternatives would you consider @roidelapluie ?
Update: we are running this in production for a while now and it saves us a huge headache of trying to keep Prometheus running without OOM when people make mistakes with their labels.
It makes it very difficult to understand the data that's in and the data that's missing.
I don't believe this much different different from your scrapes failing due to exceeding sample_limit, body size limit or invalid response. When sample_limit is exceeded one needs to go and figure out why and what's changed. When Prometheus fails to scrape your metrics you need to go and manually inspect the output of your /metrics response and fix it. In all cases actual error is on the /targets page, and same is for any errors caused by exceeding HEAD limits.
If the argument is that you query something and some metrics are there and some aren't, and that is the confusion that's too much of a problem, then sure. Due to the nature of this patch some metrics could be ingested while some other not. And that might break things like histograms if they are cut in half. But that's only if one chooses to enable a limit, and by doing so chooses to be a little confused in exchange for not loosing all metrics when Prometheus runs out of memory, which for me is a fair trade.
Still hoping to hear about any alternatives @roidelapluie
I agree with @roidelapluie. We do not want to accept this patch.
Partial ingestion is just too dangerous for data for exactly the reasons you mentioned, partial results can skew the actual query results in very unexpected ways. Yes, it's the user's choice to enable it, but it breaks some fundamental system behavior assumptions in ways that many end users will find surprising.
Here's a good demonstration of this for others looking at this thread.
sum(rate(http_requests_total{status="5xx"}[5m]))
/
sum(rate(http_requests_total[5m]))
Say you have ten instances, with a lost scrape, you might miss out on data from one, but the overall ratio is still accurate.
With the way this patch implements partial ingestion it seems to be at a metric-by-metric level. So in theory you could be missing just the status="5xx"
metric, but not the status="2xx"
metric. So the end result is skewing of the actual query in unexpected ways.
Even if you fixed this to fully ingest by metric family there will still be problems. A simple example would be some_errors_total
vs some_actions_total
. You are still left with a partial ingestion that would skew the data in non-transparent ways.
I'm not sure there are any alternatives that would be viable without breaking Prometheus design assumptions. It goes against principal of least surprise design.
Isn’t the same design breaking behaviour already supported in Prometheus by metrics relabel rules? One can drop arbitrary time series there, resulting in the same issue you describe.
That is a form of "whataboutism".
We can't prevent all failure modes, but that doesn't mean we need to add more failure modes.