etcd icon indicating copy to clipboard operation
etcd copied to clipboard

Add a new metric to measure latency of Range requests for different number of returned keys

Open michaljasionowski opened this issue 4 years ago • 7 comments
trafficstars

Range request is used in etcd both for getting only one key and for getting multiple keys. The execution time is highly dependent on the number of elements that are returned. This means that high latency of a Range request is something concerning when only one key is returned, but it’s completely expected in case of requests that return a large number of keys.

To better assess the performance of etcd and to have a metric that is better suited for use as an SLI, we can add a new metric only for Range requests that will collect information of the latency and size (in terms of number of keys returned) of a request. This metric will be a histogram over latency with 2 labels: keys_len, range_end_given. keys_len will specify the number of keys returned by the request and will be restricted to following values:

  • 0 - for cases, when no key was found
  • 1
  • >1
  • >10
  • >100
  • >1000
  • >10000
  • >100000
  • >1000000

These buckets will not be cumulative, that is bucket >10 will have responses with a number of keys greater than 10 and not greater than 100. range_end_given will specify if a non-empty range end was given in the request. In other words, range_end_given indicates if the user is requesting only 1 key. This means that for range_end_given=“false” only possible values of keys_len are “0” and “1”.

michaljasionowski avatar Nov 03 '21 15:11 michaljasionowski

Hey @michaljasionowski, thanks for posting this issue.

I agree that current request latency metric from grpc grpc_server_handling_seconds is not very useful to define a strict SLI as there is no way to distinguish between requests that return single or thousands of keys. I support adding this metric, however I think would be discuss it more to make sure we follow Prometheus best practices. ccing @lilic

Couple of questions:

  • Could you propose name of metric? Would etcd_server_range_duration_seconds make sense?
  • For keys_len label I think we could do better to follow Prometheus naming convention. Instead of having ">" sign in label value, let's add as label name suffix. Also instead of using ">" let's use "<=". What do you think about name keys_le (le stands for "less than or equal to") with values 0, 1, 10, 100, 1000, 10000, 100000, 1000000, +Inf? We need to introduce additional bucket for to for +Inf, however we can avoid distinguishing between 1 and >1.
  • For range_end_given I agree that it will be useful to distinguish between get of one key and range that scanned (possibly large space) and returned one key. However I think that "range" prefix is redundant. Would just end_given make sense?
  • Have you thought about what histogram latency buckets you would like to use? We could start with the default ones as they are already used by grpc_server_handling_seconds metric. Those are 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, +Inf
  • As you didn't specify I assume that this metric will be only used to measure latency of Range request that end with success. Normally request metrics have response code included as one of the labels (for example grpc_code for grpc_server_handling_seconds) however here we care about detecting slow responses, so including errors would introduce unwanted noise.

With those changes metric would look like

# TYPE etcd_server_range_duration_seconds histogram
# HELP etcd_server_range_duration_seconds A histogram of the Range request duration.
etcd_server_range_duration_seconds_bucket{keys_le="0", le="0.05"} 1
etcd_server_range_duration_seconds_bucket{keys_le="1", le="0.05"} 1
etcd_server_range_duration_seconds_bucket{keys_le="10", le="0.05"} 1
etcd_server_range_duration_seconds_bucket{keys_le="100", le="0.05"} 1
etcd_server_range_duration_seconds_bucket{keys_le="1000", le="0.05"} 1
etcd_server_range_duration_seconds_bucket{keys_le="10000", le="0.05"} 1
etcd_server_range_duration_seconds_bucket{keys_le="100000", le="0.05"} 1
etcd_server_range_duration_seconds_bucket{keys_le="1000000", le="0.05"} 1
etcd_server_range_duration_seconds_bucket{keys_le="+Inf", le="0.05"} 1
....

serathius avatar Nov 03 '21 16:11 serathius

Another note about cardinality. We will have

  • 10 values for all pairs of keys_le and end_given
  • 12 values for le label and 2 for sum and count

This gives us 140 timeseries, however it should be still lower than etcd_server_range_duration_seconds as we don't include response codes.

serathius avatar Nov 03 '21 16:11 serathius

Hi @serathius, thanks for comments and clarifications. They all make sense.

Yes, I don't expect to include failed requests in this metric. As you noticed, they would change the distribution we are interested in. And we already have different metrics to monitor if requests are failing.

To sum up, here is the definition of the proposed metric:

	successfulRangeHandlingTime := prometheus.NewHistogramVec(prometheus.HistogramOpts{
		Namespace: "etcd",
		Subsystem: "server",
		Name:      "range_duration_seconds",
		Help:      "A histogram of the Range request duration.",
		// Default buckets []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}
	},
		[]string{"keys_le”, “end_given”})

michaljasionowski avatar Nov 03 '21 17:11 michaljasionowski

cc @hexfusion

serathius avatar Nov 05 '21 15:11 serathius

cc @ptabor

serathius avatar Jan 12 '22 11:01 serathius

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 13 '22 07:04 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 13 '22 23:07 stale[bot]