apm-server monitoring: expose storage limit and disk metrics for TBS monitoring

From comment https://github.com/elastic/apm-server/issues/14247#issuecomment-2576116925

8.x and 9.x: As apm-server exposes lsm_size and value_log_size as monitoring metrics, expose configured storage limit as well. Then it will be possible to plot db size vs storage limit, and removes the need to dig into the logs for the configured storage limit.

9.x: Add monitoring metrics to monitor the disk utilization check.

It is up to the implementer to decide what metric to emit, whether it is a combined metric or a few separate metrics. The actual work involves adding this metric in apm-server code, and update relevant mappings in ES, integrations and metricbeat repo. See https://github.com/elastic/apm-server/issues/13475 for an example.

Feb 03 '25 18:02 carsonip

this is important as we would like to add some UI to tell the customers how much disk they are using https://github.com/elastic/kibana/issues/226600

Jul 31 '25 13:07 raultorrecilla

This is less straightforward than before. In 8.x there is one configured storage limit, and we compare db size against that. However, from 9.0, by default the storage limit is 0, and instead of comparing db size against that, we compare the disk_used vs disk_total so that TBS doesn't write to the last 20% of the disk. It does not involve db size and storage limit. I'm a bit hesitant to plumb these implementation details to monitoring metrics and all the way to the UI. We should rethink what should be exposed.

Aug 11 '25 13:08 carsonip

apm-server.sampling.tail.events.failed_writes is a counter metric that records the number of failed writes, no matter it is then discarded or directly indexed to ES. If this per-apm-server counter increases, it means apm-server is running into storage issues. I believe it is good enough for https://github.com/elastic/kibana/issues/226600

Aug 12 '25 12:08 carsonip

That said, for https://github.com/elastic/kibana/issues/226600 , I wonder how kibana would have access to monitoring metrics (think about stack monitoring). If we cannot surface this by default on ECH, I wonder how useful this will be.

Aug 12 '25 12:08 carsonip

In case we move ahead with implementing this, the actual work involves adding this metric in apm-server code, and update relevant mappings in ES, integrations and metricbeat repo. See https://github.com/elastic/apm-server/issues/13475 for an example.

Aug 28 '25 16:08 carsonip

As discussed during weekly, we would like to expand the scope of this task and ensure that it is relevant and valuable to both 8.x and 9.x. The description has been updated to reflect that. The actual metrics used and the design are up to the implementer.

Sep 02 '25 16:09 carsonip

As mentioned during the weekly, this story is similar to: https://github.com/elastic/apm-server/issues/18084. Cross posting just to make sure everyone is aware. Depending on how the metric for 9.X is implement it might cover one of the cases in https://github.com/elastic/apm-server/issues/18084 (I could be wrong, so feel free to disregard)

Sep 22 '25 15:09 isaacaflores2

we need to remember to add docs as done here

Nov 12 '25 15:11 raultorrecilla

Have a set of Draft PRs up and tested manually as outlined in each PR desc. Final steps involve hashing out all the small details to get CI passing for each.

1. APM Server Self-Monitoring

Data Flow: APM Server → Elasticsearch (no Metricbeat)
Data Structure: Legacy index .monitoring-beats-7-* (no datastream)
APM Server collects and sends its own monitoring data directly to Elasticsearch
Requires monitoring.enabled: true and monitoring.elasticsearch in apm-server.yml config
Required PR:
- Metrics implementation in APM Server, see https://github.com/elastic/apm-server/pull/19568
- Updates to monitoring-beats.json in ES, see https://github.com/elastic/elasticsearch/pull/138131

2. Stack Monitoring (xpack)

Data Flow: APM Server + Metricbeat (xpack) + Elasticsearch
Data Structure: Datastream .monitoring-beats-8-mb
Required PR:
- Metrics implementation in APM Server, see https://github.com/elastic/apm-server/pull/19568
- Updates to monitoring-beats-mb.json in ES, depends on https://github.com/elastic/elasticsearch/pull/138131

3. Metricbeat (without xpack)

Data Flow: APM Server → Metricbeat (no xpack) → Elasticsearch
Data Structure: Datastream metricbeat-*
APM Server only needs to expose HTTP endpoints (via http.enabled: true) in config
Required PR:
- Metrics implementation in APM Server, see https://github.com/elastic/apm-server/pull/19568
- Updated to metricbeat template fields, see https://github.com/elastic/beats/pull/47709

4. EA Integration

Data Flow: APM Server + Elastic Agent + Elasticsearch (no Metricbeat)
Data Structure: Datastream metrics-elastic_agent.apm_server.*
Required PR:
- Metrics implementation in APM Server, see https://github.com/elastic/apm-server/pull/19568
- Update datastream fields, see https://github.com/elastic/integrations/pull/16560

Nov 20 '25 00:11 rubvs

@raultorrecilla the link you shared in https://github.com/elastic/apm-server/issues/15533#issuecomment-3522656909 is referencing this issue. Can you update it please.

Nov 20 '25 00:11 rubvs

Do we backport this change?

Dec 01 '25 03:12 ericywl

Do we backport this change?

Quoting @carsonip from https://github.com/elastic/apm-server/issues/15533#issuecomment-3246025109:

As discussed during weekly, we would like to expand the scope of this task and ensure that it is relevant and valuable to both 8.x and 9.x. The description has been updated to reflect that. The actual metrics used and the design are up to the implementer.

These metrics will be useful for troubleshooting in 8.19 and 9.x (backport to currently active release branches).

Dec 01 '25 07:12 simitt