dapr
dapr copied to clipboard
Low cardinality metrics issues
Intro
With the low cardinality metrics enabled, a lot of metrics are losing their value and in some cases even become completely unusable as they do not have any useful information provided except the application ID. Monitoring systems and dashboards previously using high cardinality settings, are broken and there's seems no suitable replacement can be found with low cardinality enabled. The value of having a low cardinality setting is understandable and the use-cases differ between the users, however having the option to have a middle ground would be great. Please consider some scenarios where low cardinality might render metrics unusable (see below).
Test method
I've done an investigation by building a few simple applications using the code samples mentioned below. Using simple applications to send/receive HTTP requests using 2 different methods, based on the examples provided in: https://docs.dapr.io/developing-applications/building-blocks/service-invocation/howto-invoke-discover-services/#invoke-the-service (plain HTTP calls) https://github.com/dapr/go-sdk/tree/main/examples/service/serving/http (SDK server) https://github.com/dapr/go-sdk/tree/main/examples/service/client (SDK client)
Metrics checked
Application: service-invocation-http-client
Metrics: dapr_http_server_latency_
(_sum
, _count
, _bucket
)
High cardinality:
dapr_http_server_latency_bucket{app_id="service-invocation-http-client",method="POST",path="/post-service-invocation-plain",status="200",le="100"} 1160
dapr_http_server_latency_bucket{app_id="service-invocation-http-client",method="GET",path="/get-service-invocation-plain",status="500",le="100"} 321
dapr_http_server_latency_bucket{app_id="service-invocation-http-client",method="DELETE",path="/delete-service-invocation-plain",status="500",le="100"} 299
dapr_http_server_latency_bucket{app_id="service-invocation-http-client",method="PUT",path="/put-service-invocation-plain",status="200",le="100"} 1149
Low cardinality:
dapr_http_server_latency_bucket{app_id="service-invocation-http-client",method="InvokeService/service-invocation-http-server",status="200",le="100"} 3584
Metrics: dapr_http_server_request_count
High cardinality:
dapr_http_server_request_count{app_id="service-invocation-http-client",method="GET",path="/get-service-invocation-plain",status="200"} 1742
dapr_http_server_request_count{app_id="service-invocation-http-client",method="PUT",path="/put-service-invocation-plain",status="200"} 1693
Low cardinality:
dapr_http_server_request_count{app_id="service-invocation-http-client",method="InvokeService/service-invocation-http-server",status="200"} 6057
Issue: With low cardinality calling all paths became merged into one single method name, while hiding the real path that is being called and each of those paths is likely to have a different latency. The same goes for the detection of issues of certain paths, for example, POST can produce errors, which might be hidden, and no way to know which path produces error responses. With latency metrics missing paths, monitoring will be unable to track which method call is slow or fast or having an issue, all percentiles become useless as well. Example:
- calls to
/put-service-invocation-plain
take 100ms - calls to
/get-service-invocation-plain
take 5ms
Applications: service-invocation-http-server
, service-invocation-http-client
Metrics: dapr_http_server_request_bytes_
(_sum
, _count
, _bucket
)
High cardinality:
dapr_http_server_request_bytes_bucket{app_id="service-invocation-http-client",le="4096"} 9702
Low cardinality:
dapr_http_server_request_bytes_bucket{app_id="service-invocation-http-client",le="4096"} 4950
Applications: service-invocation-http-server
, service-invocation-http-client
Metrics: dapr_http_server_response_
(_sum
, _count
, _bucket
)
High cardinality:
dapr_http_server_response_bytes_bucket{app_id="service-invocation-http-client",le="4096"} 9702
dapr_http_server_response_bytes_count{app_id="service-invocation-http-client"} 9702
dapr_http_server_response_bytes_count{app_id="service-invocation-http-server"} 1197
Low cardinality:
dapr_http_server_response_bytes_bucket{app_id="service-invocation-http-client",le="4096"} 8340
dapr_http_server_response_bytes_count{app_id="service-invocation-http-client"} 8340
dapr_http_server_response_bytes_count{app_id="service-invocation-http-server"} 998
Issue: The metrics don't record any labels other than the current app_id
, for both plain HTTP client and server. Unable to differentiate the calls being made at all, only having the current application ID, missing source/destination app ID and the path/method being called.
Applications: service-invocation-http-server
Metrics: dapr_http_client_roundtrip_latency_
(_sum
, _count
, _bucket
)
High cardinality:
dapr_http_client_roundtrip_latency_bucket{app_id="service-invocation-http-server",method="GET",path="dapr/config",status="404",le="3"} 1
dapr_http_client_roundtrip_latency_bucket{app_id="service-invocation-http-server",method="DELETE",path="delete-service-invocation-plain",status="500",le="13"} 691
dapr_http_client_roundtrip_latency_bucket{app_id="service-invocation-http-server",method="GET",path="get-service-invocation-plain",status="200",le="6"} 2857
Low cardinality:
dapr_http_client_roundtrip_latency_bucket{app_id="service-invocation-http-server",status="404",le="6"} 1
Issue: With the low cardinality, again all calls are merged into one, in this case even worse, the only labels left are app_id
(name of the current application) and status
. Making metrics unusable and unable to distinguish the specific calls and troubleshoot properly.
In the example samples above, the 404
status calls are coming from internal calls to the path dapr/config
, which are not part of the application. In low cardinality mode, these 404
errors become part of the application and might falsely cause issues in monitoring systems making them think something is wrong with the actual application.
Applications: service-invocation-http-server
Metrics: dapr_http_client_sent_bytes_
(_sum
, _count
, _bucket
)
High cardinality:
dapr_http_client_sent_bytes_bucket{app_id="service-invocation-http-server",method="GET",path="dapr/config",status="",le="65536"} 1
dapr_http_client_sent_bytes_bucket{app_id="service-invocation-http-server",method="DELETE",path="delete-service-invocation-plain",status="",le="16384"} 3576
dapr_http_client_sent_bytes_bucket{app_id="service-invocation-http-server",method="PUT",path="put-service-invocation-plain",status="",le="65536"} 3576
Low cardinality:
dapr_http_client_sent_bytes_bucket{app_id="service-invocation-http-server",status="",le="16384"} 12241
Issue: As in the above example, the same issue can be observed here, no way of knowing what path is being called, can't measure bytes sent. In addition, internal dapr/config
is mixed into the application metric and the metric value is now incorrect.
Note: Using go-sdk for both client and server service invocations produces the same results except the path
value as a full Dapr path call with version etc.
Additional Files
Attaching dumps from the 9090
port from all 8 applications used in the testing.
plain-client-high.txt plain-client-low.txt plain-server-high.txt plain-server-low.txt sdk-client-high.txt sdk-client-low.txt sdk-server-high.txt sdk-server-low.txt
First, I propose that we don't make low cardinality the default behavior in 1.14.
Instead, we use 1.14 to implement the following in low cardinality:
- Allow regex of URL patterns to be used in low cardinality too, if it matches, then add the path's pattern to the metric. If there is no match, then don't add the path (like it is today).
- Always add the HTTP verb in low cardinality.
- (If possible) always include the app's gRPC method being invoked
Addressing another point, low cardinality should be fine with adding path=/dapr/config or path=/healthz and any other Dapr-specific paths without an identifier.
I think there is a better solution than the custom regex. Dapr (in both low and high cardinality) would apply a series of builtin regexes to remove identifiers from the URL paths. If the path remains unmodified, low cardinality defaults to hiding the path while high cardinality defaults to showing the path in metrics. If custom regex and ID removal are both enabled, the built in ID removal is only applied if no custom regex matches.
Examples:
/customers/092e69cc-9950-49c1-ac53-ba2c5b47ab12 -> /customers/UUID
/customers/[email protected] -> /customers/EMAIL
/orders/123 -> /orders/NUMBER
/orders/092e69cc-9950-49c1-ac53-ba2c5b47ab12/items/100 -> /orders/UUID/items/NUMBER
In summary:
- [ ] Don't make low cardinality default in 1.14
- [ ] Add HTTP verb for all HTTP calls
- [ ] Add all known Dapr paths to the app that don't contain an ID:
/dapr/config
,/healthz
,/subscriptions
- [ ] Offer regex in low cardinality mode as well as opt-in
- [ ] Offer a configuration to opt-in for removal of IDs in low & high cardinality modes
- [ ] (If possible) always include the app's gRPC method being invoked
Ideally, we converge into low cardinality with automatic ID removal being the default behavior.
Ideally, we converge into low cardinality with automatic ID removal being the default behavior
Do you mean converge into high
cardinality?
Ideally, we converge into low cardinality with automatic ID removal being the default behavior
Do you mean converge into
high
cardinality?
I mean that eventually (not in 1.14) we will have low cardinality with automatic ID removal for HTTP paths as the default behavior. I think high cardinality will be opt-in if not removed.
I think Artur is saying that the low cardinality will be the default since that with the built-in regex identifiers only the remaining paths that the identifiers are not matched will be shown as the "low cardinality" metric we know today.
The usage of regexs is going to be a problem for both performance and usability reasons. There's also no way to be able to create regexs that match every kind of ID perfectly. What if the ID is a slug ("/blog/my-post-title"), a phone number (every country has its own different format), a national ID card number (again different per country/state)?
I think there's some additional clarification that was provided, based on the context from the issue and PR: #6919 #6723 #6581
- Every new "pivot" that gets added to the metrics causes significant memory usage, since it creates a new bucket. We have seen Dapr crashing in production with OOMs after using 500MB or more of memory just for the metrics subsystem, as Dapr was used for service invocation against RESTful endpoints (where an ID was part of the URL).
The point of low cardinality was indeed to have as little cardinality as possible. Anything that adds a bucket for each HTTP verb, or partial URL, is not "low" cardinality. - There's also a matter of symmetry with gRPC APIs. Even before the changes in Dapr 1.13, gRPC APIs were always low-cardinality, due to the fact that the "URL" was just the method name. With the low cardinality change implement for HTTP, the behavior is now consistent.
- Users have always been able to use regular expressions to "trim" cardinality for metrics emitted by the HTTP endpoints, but:
- There's the old mantra that if you try to solve a problem with a regular expression, now you have two problems (see Jeff Atwood)
- Regexs have a non-insignificant performance impact
I understand that there are users who do want to see higher cardinality for HTTP metrics. This is why we maintained a configuration option metrics.increasedCardinality
which preserves the same behavior as Dapr 1.12. The default value for that changes to false
in 1.14, but that is just the default value, and users can switch a configuration option.
Following up on the feedback above we would like to propose a few adjustments to the initial proposal and we will work on the items:
- [x] Open a Dapr proposal for path matching API. https://github.com/dapr/proposals/pull/58
- [x] Keep Low cardinality will be the default on 1.14
- [ ] Include HTTP Verbs again
- [ ] Include well-known Dapr paths (/dapr/config , /healthz , /subscriptions , etc)
- [ ] Include status codes grouped: 2XX , 3XX , 4XX ,5XX
- [ ] Add a new Top-level increasedCardinality config, which will be used if no app-level config is set.
I think there's some additional clarification that was provided, based on the context from the issue and PR: #6919 #6723 #6581
Every new "pivot" that gets added to the metrics causes significant memory usage, since it creates a new bucket. We have seen Dapr crashing in production with OOMs after using 500MB or more of memory just for the metrics subsystem, as Dapr was used for service invocation against RESTful endpoints (where an ID was part of the URL). The point of low cardinality was indeed to have as little cardinality as possible. Anything that adds a bucket for each HTTP verb, or partial URL, is not "low" cardinality.
There's also a matter of symmetry with gRPC APIs. Even before the changes in Dapr 1.13, gRPC APIs were always low-cardinality, due to the fact that the "URL" was just the method name. With the low cardinality change implement for HTTP, the behavior is now consistent.
Users have always been able to use regular expressions to "trim" cardinality for metrics emitted by the HTTP endpoints, but:
- There's the old mantra that if you try to solve a problem with a regular expression, now you have two problems (see Jeff Atwood)
- Regexs have a non-insignificant performance impact
I understand that there are users who do want to see higher cardinality for HTTP metrics. This is why we maintained a configuration option
metrics.increasedCardinality
which preserves the same behavior as Dapr 1.12. The default value for that changes tofalse
in 1.14, but that is just the default value, and users can switch a configuration option.
Regarding point (1), I made some comments in the proposal by @nelson-parente that will give a way for users and platforms to keep the existing behavior (if so they want).
Now on point (2), when discussing gRPC<->HTTP for service invocation, Dapr maintainers have moved away from trying to match HTTP and gRPC protocols a while ago, making it a non-issue. One of the consequences of that decision is that Dapr does not convert HTTP to gRPC or vice versa when invoking services, meaning the requestor's app must invoke with the same protocol that the receiving app listens to (sidecar to sidecar is always over gRPC but that is a "tunnel" for the original call and does not need to adhere to the app's protocol).
This issue has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 67 days. If this issue is still valid, please ping a maintainer and ask them to label it as pinned, good first issue, help wanted or triaged/resolved. Thank you for your contributions.