thanos
thanos copied to clipboard
receive: Implement head series limits per tenant
- [ ] I added CHANGELOG entry for this change.
- [ ] Change is not relevant to the end user.
Changes
Add the ability to limit the total number of series in head per tenant per receive replica. The limit can be specified as an additional HTTP header (X-Thanos-Series-Limit by default) on the remote write request.
As mentioned above, this limit is per tenant, per replica, which means that with the current implementation, the tenant can actually write more series in total than specified in the limit. For example, in a hashring with 3 instances of Receive, with a series limit of 100, the tenant can actually write at most 300 active series (assuming equal distribution of series between all 3 nodes). The actual limit can be less than 300 in case the load is not equally distributed and one node hits the limit earlier, but it will never be less than 100.
Alternative
An alternative approach is to calculate an effective limit by using the replication factor and number of nodes in the hashring. The limit actually used for local tsdb writes can be defined_limit * replication_factor / num_nodes.
So essentially if the over all limit we want is 150 and we have 3 nodes with replication factor 1, we will have a per node local limit of 50 (150 * 1 / 3).
But this assumes that the data is equally distributed among all nodes, but in practice, it can happen that one node hits the 50 series limit earlier than other nodes, effectively denying the service before the user actually hit the 150 series limit.
Verification
Tested locally with multiple configurations of Receive including split Router and Ingester mode.
I think this is awesome, and we really need to have more of these options 👍 From a technical view, your two possible solutions make sense. From a user perspective, it becomes a bit harder. In a dynamic environment, we end up with a dynamic limit. Also the limit will always be roughly best effort.
Just thinking out loud here; since all the components are capable of talking to each other. Why not keep a small state of the active series? This would solve a lot of guessing and we can be very precise on the limit.
That said, I don't know if that's even possible in a timely manner and this solution works for me as well to start with.
Please, do not forget to add some documentation about this. 🙏
Also, I couldn't find a good definition in Prometheus or Thanos docs of what is considered an "active series". This hides key information from the reader and I believe we should add a link to it in the docs of this feature.
I found this one below in Prometheus 1.18 storage documentation, but it's not easy to understand (when is a chunk closed?).
A series with an open head chunk is called an active series
This blog post from Grafana seems to have an easier to understand definition:
For Prometheus, a time series is considered active if new data points have been generated within the last 15 to 30 minutes.
Maybe we should open an issue for Prometheus to define what's considered an "active series"?
@wiardvanrij I think what you said makes sense, and we might not even need to make some shared state, if we can somehow utilise the replication step to share this info. I'll explore that a bit and report here if it doesn't work out.
@douglascamata Yes! we need good documentation around it, especially with the tradeoffs that we are making. The definition of "active" series in this PR's context is simple. It's the number of series currently in the head block of the tenant's tsdb instance. So basically the number of unique series ingested since the last block was cut (i.e 2 hours window with default settings). I think the grafana definition might be same as well, it might just be that they cut new block every 15 to 30 minutes?
The definition from Prometheus you mentioned is a bit old, from Prometheus 1.8 days, the tsdb was completely rewritten with Prometheus 2.0, and with it, the definitions changed.
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.