apm-server
apm-server copied to clipboard
On Elastic Cloud, would like a troubleshooting assistance when APM signals stop flowing to the APM GUI
Problem Statement
When using Elastic through Elastic Cloud (aka ESS), the apm signals may stop flowing to the the APM GUI.
Causes reside in the Elastic cluster or be external and there may be no error reported by the apm agents. An upgrade problem is an example of such a problem.
When this problem happen, I would like to have assistance in Elastic Cloud GUI to troubleshoot.
This assistance could be composed of
- Status and version check: confirm that APM Server is running, verify that the version of APM Server and of the "APM integration" meet the requirements
- Dashboard of the key metrics of APM server
- Ingest rate broken down by ingestion status
- Success
- Failure
- Rejected by APM Server
- Accepted by APM Server but failed to be persisted in Elasticsearch
- Maybe ingest rate broken down by the signal type
- Indications of connection failures like authentication failures
- Timestamps of the last processed data point and of the last successfully ingested data point
- Ingest rate broken down by ingestion status
- Audit trail of the last log messages produced by APM Server, at least of the last warning & error log messages
Elastic Cloud screens related to APM Server health



Workflow to get more troubleshooting details enabling "Logs and metrics"
In general, the cloud UI aims to cover the general setup, sizing and healthiness of a deployment. But the solutions are responsible for providing a more detailed monitoring.
Dashboard of the key metrics of APM server
Ingest rate broken down by ingestion status Success Failure Rejected by APM Server Accepted by APM Server but failed to be persisted in Elasticsearch Maybe ingest rate broken down by the signal type Indications of connection failures like authentication failures Timestamps of the last processed data point and of the last successfully ingested data point
Users can turn on metrics collection in cloud. With that most/all of this information is available in the Stack Monitoring UI - requests, responses, number of processed events and success rate/failure.
Audit trail of the last log messages produced by APM Server, at least of the last warning & error log messages
Similar, users can turn on logs collection and can then consume the respective logs in the Kibana UI.
Status and version check: confirm that APM Server is running, verify that the version of APM Server and of the "APM integration" meet the requirements
@ph is currently reworking the healthcheck API and UI, which should also help give better insights on healthiness on cloud. Since APM Server is now managed by Elastic Agent, this is a topic that the Elastic Agent needs to solve.
@cyrille-leclerc please review my comments above, and let me know if you agree to close this issue. I don't believe there is any work required from the APM Server team for this at the moment.
@simitt sorry for the delay. I have captured on this ticket usability notes on the end of the troubleshooting workflow. @ph Are you the person to talk to to discuss improvements of the workflow to access troubleshooting data on "APM Server"?