apm-server
apm-server copied to clipboard
Slow response of APM Server not being reflected in Stack Monitoring UI
With the switch to the new ES output handling, the APM Server changes its behavior when being overloaded. Instead of returning 503 - Queue is full errors, it starts responding much slower to APM agent requests. This causes APM agents to eventually close their connection and log errors. The APM Server itself does not issue any log lines indicating that it is overloaded and doesn't record error metrics. The Stack Monitoring UI doesn't give indicators that the server is overloaded, except for tracking a higher memory usage (because of the requests being buffered in memory).
Parts that should be improved:
- make the number of max requests configurable (currently hardcoded to
10) (https://github.com/elastic/apm-server/issues/7719). - allow customizing the
yamlbox for the Elastic Cloud output via Fleet; since8.0a dedicated cloud output is configured, avoiding public traffic and any configuration on it is frozen - record metrics indicating that more events are processed than can be ingested to ES; for example track how many
availablechannels are created and when a new channel is available for processing events. - add log warnings events are queued up
- add information to Stack Monitoring UI or ship with pre-built monitoring visualizations
Scope for 8.3:
- ensure metrics are collected so they can be surfaced in visualizations
- do some research how to surface this to the customer without adapting the existing Stack Monitoring UI