apm-server icon indicating copy to clipboard operation
apm-server copied to clipboard

On Elastic Cloud, would like a troubleshooting assistance when APM signals stop flowing to the APM GUI

Open cyrille-leclerc opened this issue 2 years ago • 3 comments

Problem Statement

When using Elastic through Elastic Cloud (aka ESS), the apm signals may stop flowing to the the APM GUI.

Causes reside in the Elastic cluster or be external and there may be no error reported by the apm agents. An upgrade problem is an example of such a problem.

When this problem happen, I would like to have assistance in Elastic Cloud GUI to troubleshoot.

This assistance could be composed of

  • Status and version check: confirm that APM Server is running, verify that the version of APM Server and of the "APM integration" meet the requirements
  • Dashboard of the key metrics of APM server
    • Ingest rate broken down by ingestion status
      • Success
      • Failure
        • Rejected by APM Server
        • Accepted by APM Server but failed to be persisted in Elasticsearch
      • Maybe ingest rate broken down by the signal type
      • Indications of connection failures like authentication failures
      • Timestamps of the last processed data point and of the last successfully ingested data point
  • Audit trail of the last log messages produced by APM Server, at least of the last warning & error log messages

Elastic Cloud screens related to APM Server health

image image image

Workflow to get more troubleshooting details enabling "Logs and metrics"

image

image

image

cyrille-leclerc avatar Apr 07 '22 20:04 cyrille-leclerc

In general, the cloud UI aims to cover the general setup, sizing and healthiness of a deployment. But the solutions are responsible for providing a more detailed monitoring.

Dashboard of the key metrics of APM server

Ingest rate broken down by ingestion status Success Failure Rejected by APM Server Accepted by APM Server but failed to be persisted in Elasticsearch Maybe ingest rate broken down by the signal type Indications of connection failures like authentication failures Timestamps of the last processed data point and of the last successfully ingested data point

Users can turn on metrics collection in cloud. With that most/all of this information is available in the Stack Monitoring UI - requests, responses, number of processed events and success rate/failure.

Audit trail of the last log messages produced by APM Server, at least of the last warning & error log messages

Similar, users can turn on logs collection and can then consume the respective logs in the Kibana UI.

Status and version check: confirm that APM Server is running, verify that the version of APM Server and of the "APM integration" meet the requirements

@ph is currently reworking the healthcheck API and UI, which should also help give better insights on healthiness on cloud. Since APM Server is now managed by Elastic Agent, this is a topic that the Elastic Agent needs to solve.

simitt avatar May 06 '22 13:05 simitt

@cyrille-leclerc please review my comments above, and let me know if you agree to close this issue. I don't believe there is any work required from the APM Server team for this at the moment.

simitt avatar May 06 '22 13:05 simitt

@simitt sorry for the delay. I have captured on this ticket usability notes on the end of the troubleshooting workflow. @ph Are you the person to talk to to discuss improvements of the workflow to access troubleshooting data on "APM Server"?

cyrille-leclerc avatar May 16 '22 15:05 cyrille-leclerc