thanos
thanos copied to clipboard
gRPC connections metric has long label values
Thanos, Prometheus and Golang version used:
latest - v0.27.0
What happened:
When running a survey of long label values on our system I discovered Thanos was responsible for generating the longest label name length
thanos_store_nodes_grpc_connections{external_labels="..."}
The maximum length is 154436 characters.
This is likely because of the large number of Prometheus instances we have writing to a cluster-wide S3 bucket.
What you expected to happen:
This label needs to be removed or modified to limit the length to something more sane.
How to reproduce it (as minimally and precisely as possible):
Have stores serving lots of buckets with lots of external labels.
Full logs to relevant components:
Anything else we need to know:
CC @hitanshu-mehta
I agree to remove the label external_labels. Using the endpoint address here might be a better solution to represent each store.
I'll work on this
Hi @yeya24 from what I see in:
https://github.com/thanos-io/thanos/blob/ee512aed1fece177b9257e80ba6c6c5d5986620a/pkg/query/endpointset.go#L193-L202
We have external_labels and store_type. Do you think it's safe to remove both or just external_labels?
Using the endpoint address here might be a better solution to represent each store.
Could you please show me where I can take the endpoint address from?
Removing external labels only should be good.
You can get the endpoint address https://github.com/thanos-io/thanos/blob/main/pkg/query/endpointset.go#L393. You might need to change the Update method or somehow to include the endpoint addresses.
We have been discussing this in the past. The endpoint address was assumed not stable enough (e.g. same store API replica on restart would have different IP). On top of that IP does not give you a lot, when debugging querier using this metric.
To me, we have two solutions:
A) Remove ext label label totally.
B) Enforce label limit by checking if we external labels label value is larger than X (e.g let's say 1000 chars). If yes, capping it to <first 1000 chars of external labels>(...).
What about B?
We have been discussing this in the past. The endpoint address was assumed not stable enough (e.g. same store API replica on restart would have different IP). On top of that IP does not give you a lot, when debugging querier using this metric.
To me, we have two solutions:
A) Remove ext label label totally. B) Enforce label limit by checking if we external labels label value is larger than X (e.g let's say 1000 chars). If yes, capping it to
<first 1000 chars of external labels>(...).What about B?
If there're no other better labels to replace here, then +1 for B.
I would much prefer A. The gRPC metrics are useful to alert on, but I would use tracing data to do debugging.
One thing that would be more useful than just the IP, would be to propagate the SRV record names through. This would allow the dnssrv store flag lists to be more useful.
Ok, so my proposal would be to:
- Always trim external labels to e.g. 1000 chars.
- Add a
[]stringflag to querier likequerier.conn-metric.labelsthat allows specifying NO, or any of the labels likeexternal_labels,store_type,record_name(defaultexternal_labels,store_type) which will switch labels thethanos_store_nodes_grpc_connectionshas.
This allows user to choose as they want and also would avoid breaking compatibility.
@bwplotka LGTM.
Closed by https://github.com/thanos-io/thanos/pull/5785