aws-app-mesh-roadmap
aws-app-mesh-roadmap copied to clipboard
Feature Request: Improve AWS metrics extension documentation
If you want to see App Mesh implement this idea, please upvote with a :+1:.
Tell us about your request The learning curve for envoy-related metrics is quite steep. If documentation related to what AWS metrics extension metrics mean, some basic golden paths for app mesh virtual node monitoring can be improved then the learning curve would not be so steep.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? My team is in the process of adopting AppMesh into our infrastructure. We run an ECS stack with a traditional ELB setup. As part of the transition, we are reworking our internal alarm and application health monitoring to account for ELBs going away and being replaced by Envoy. The process of selecting which metrics to use, and how to aggregate them was arduous.
The introduction of envoy means that there are now more legs of traffic involved in requests going to and from your application. These new legs have multiple different labels depending on the context in which you describe them. The AWS document Monitoring your application using Envoy metrics does a pretty good job of describing these legs and defining nomenclature to label them with. We ended up making our internal diagram (attached below) which builds on the one provided by AWS to make the distinction between egress/ingress and upstream/downstream more clear for our internal stakeholders.
The AWS-defined nomenclature is useful however the various other documents do not adhere to the terminology consistently which can make understanding the Envoy metrics a challenge. Some of the other documents in question are as follows:
One example is the section labels in the document exporting metrics – "Metrics Related to Outbound Traffic". Outbound can have two meanings depending on the context; egress and upstream could both be described as outbound relative to the constructs of Envoy or the application. Additionally, the metrics in the metrics extension docs are defined with something akin to "The number of HTTP requests to an upstream target that resulted in a 2xx HTTP response". Upstream/downstream may describe requests that are both ingress and egress for an application. In the world of ELBs, a target is one of the nodes being proxied by the load balancer and not one of the dependencies of the application. After reading the various documents multiple times I was eventually able to infer that outbound/inbound meant egress/ingress and that the "upstream target" specifically referred to the leg of egress traffic which is upstream of Envoy.
If these metrics descriptions were cleaned up a bit, and then tied back to a diagram like the one attached then understanding them would be much faster. I also believe that if there was documentation about how the extension aggregates were created, such as the combinations of listener/cluster metrics which are rolled up, then it would go even further to make them more understandable.
It would also be helpful to have some opinionated recommendations from AWS about which metrics are the most useful for replacing ELB metrics for monitoring application health. The landscape of metrics from an ELB vs Enovy is very different and there are not really any true analogues. AWS metric extension metrics need to be rolled up entirely differently to get a picture similar to ELB metrics. After spending a lot of time grokking the AWS documentation it is apparent that the metrics extension is one of the better paths forward and most "out of the box", but this does take a fair bit of time to understand. If there was a guide for creating "ELB" like metrics rollups out of the metrics extension then monitoring AppMesh would be much more friendly to new comers.
Thanks for reading!
Attachments