apm icon indicating copy to clipboard operation
apm copied to clipboard

[APM] Runtime metrics for all agents in the APM App

Open nehaduggal opened this issue 5 years ago • 7 comments

Summary of the problem Most of the APM agents collect runtime metrics data which is available for customers to visualize via the apm-contrib dashboards. Java agent is the only agent that surfaces the runtime performance data on a JVMs tab for each instance of the service that is reporting. We should have a similar page for all the other agents to surface the metrics that we collect in the curated UI.

List known (technical) restrictions and requirements

For JVM page specifically we chose the tabular approach that shows individual instances instead of a chart with different line graphs to capture each instance because the number of instances reporting can be large. This assumption is probably true for all other agents. We should be able to surface the runtime performance captured by the agents and displayed in the APM App in a way that is compatible for each language ecosystem.

References

nehaduggal avatar Mar 12 '20 05:03 nehaduggal

Pinging @elastic/observability-design (design)

elasticmachine avatar Mar 12 '20 05:03 elasticmachine

Here is a suggestion how we could design it by leveraging existing observability UI components:

  • Visualize instances using waffle explorer from metrics UI.
  • Allow users to see how the instances are performing by bringing multiple metrics:
    • Transaction metrics (requests/min, response time, errors rate).
    • Runtime metrics like GC%, Gen 0 size, etc. (will be slightly different per runtime).
    • Container metrics (if available).
    • Host metrics (if available).
  • Allow users to group by multiple dimensions:
    • APM service attributes (service version, runtime version, cloud availability zone, etc).
    • Container attributes like image name.
    • K8s attributes like availability zone, pod name, etc.
    • Datacenter (for on-premises it would be nice to determine it based on IP masks or host names naming patters, but that might be manual) or cloud datacenter

This design would allow leveraging familiar design where it is relevant (service instances are similar to infrastructure).

Linking from service view to the infra metrics would provide benefits to the SRE's to understand service performance across the farm(s) and how it relates to performance of infrastructure which hosts it, especially during the issues.

Test - Service Infrastructure

alex-fedotyev avatar May 06 '20 23:05 alex-fedotyev

@sorantis brought couple interesting points about the proposal above:

  • What kind of drill down would be expect from the list of instances? Would it go to APM page for instance details? How would it link to the infrastructure UI like container or host view?
  • Idea add an anomaly score/severity to the list of metrics for each instance similar to duration or error rates.

alex-fedotyev avatar May 08 '20 23:05 alex-fedotyev

cc @lreuven

graphaelli avatar May 11 '20 15:05 graphaelli

Added design issue: https://github.com/elastic/apm/issues/301

alex-fedotyev avatar Jul 21 '20 22:07 alex-fedotyev

I'm bringing this back up as an opportunity to implement an updated metrics experience in the near-term which adds service instance level breakdown ability and adds the additional metrics that are listed for each agent below. I imagine there are a few agents missing on the list since this issue was initially created.

With the switch to Elastic Charts, there should be no blockers on the visualization part. From a design perspective, there might be some guidance on the color palettes and how the visualizations should be put together and laid out. Additionally, I imagine there should be a suggested layout for the overview/list of instances similar to the Java JVM metrics experience.

Overall I think the UI team should be able to pick this up in https://github.com/elastic/kibana/issues/63573 and ask for guidance in implementation from either design or agents.

Long-term service instance metrics experience will be explored and design in #301 in partnership with @alex-fedotyev

Thoughts? @nehaduggal @sqren @alex-fedotyev


Node

  • Memory: RSS
  • Memory: Total Heap Allocated
  • Memory: Heap Used
  • Event loop delay (ms)
  • Active handles
  • Active requests
  • CPU user/system time/utilization
  • Garbage collection(Scavenge, MarkSweepCompact, Incremental marking) - {Stretch Goal}

Ruby

  • Time in Garbage collection
  • Frequency of GC
  • Memory usage
  • Thread count

Python

  • Garbage collection
  • Memory usage(existing memory usage graph on the apm-contrib dashboard)
  • I/O
  • Thread count (Gauge)
  • Context switches (Counter)
  • Voluntary
  • Involuntary
  • Open file handles (Gauge)

Go

  • Metrics are already captured. Todo: Custom dashboard in the apm-contrib repo.

PHP

  • No additional metrics defined

.Net

  • No additional metrics defined

formgeist avatar Dec 16 '20 10:12 formgeist

I would rather have us reconcile the new workflows that are being designed with the current UI we have for metrics instead of tackling this. multiple times. Once we have the UI, we can work on on-boarding metrics from all other agents.

nehaduggal avatar Dec 18 '20 22:12 nehaduggal