apm-server icon indicating copy to clipboard operation
apm-server copied to clipboard

monitoring: expose storage limit and disk metrics for TBS monitoring

Open carsonip opened this issue 10 months ago • 12 comments

From comment https://github.com/elastic/apm-server/issues/14247#issuecomment-2576116925

8.x and 9.x: As apm-server exposes lsm_size and value_log_size as monitoring metrics, expose configured storage limit as well. Then it will be possible to plot db size vs storage limit, and removes the need to dig into the logs for the configured storage limit.

9.x: Add monitoring metrics to monitor the disk utilization check.

It is up to the implementer to decide what metric to emit, whether it is a combined metric or a few separate metrics. The actual work involves adding this metric in apm-server code, and update relevant mappings in ES, integrations and metricbeat repo. See https://github.com/elastic/apm-server/issues/13475 for an example.

carsonip avatar Feb 03 '25 18:02 carsonip

this is important as we would like to add some UI to tell the customers how much disk they are using https://github.com/elastic/kibana/issues/226600

raultorrecilla avatar Jul 31 '25 13:07 raultorrecilla

This is less straightforward than before. In 8.x there is one configured storage limit, and we compare db size against that. However, from 9.0, by default the storage limit is 0, and instead of comparing db size against that, we compare the disk_used vs disk_total so that TBS doesn't write to the last 20% of the disk. It does not involve db size and storage limit. I'm a bit hesitant to plumb these implementation details to monitoring metrics and all the way to the UI. We should rethink what should be exposed.

carsonip avatar Aug 11 '25 13:08 carsonip

apm-server.sampling.tail.events.failed_writes is a counter metric that records the number of failed writes, no matter it is then discarded or directly indexed to ES. If this per-apm-server counter increases, it means apm-server is running into storage issues. I believe it is good enough for https://github.com/elastic/kibana/issues/226600

carsonip avatar Aug 12 '25 12:08 carsonip

That said, for https://github.com/elastic/kibana/issues/226600 , I wonder how kibana would have access to monitoring metrics (think about stack monitoring). If we cannot surface this by default on ECH, I wonder how useful this will be.

carsonip avatar Aug 12 '25 12:08 carsonip

In case we move ahead with implementing this, the actual work involves adding this metric in apm-server code, and update relevant mappings in ES, integrations and metricbeat repo. See https://github.com/elastic/apm-server/issues/13475 for an example.

carsonip avatar Aug 28 '25 16:08 carsonip

As discussed during weekly, we would like to expand the scope of this task and ensure that it is relevant and valuable to both 8.x and 9.x. The description has been updated to reflect that. The actual metrics used and the design are up to the implementer.

carsonip avatar Sep 02 '25 16:09 carsonip

As mentioned during the weekly, this story is similar to: https://github.com/elastic/apm-server/issues/18084. Cross posting just to make sure everyone is aware. Depending on how the metric for 9.X is implement it might cover one of the cases in https://github.com/elastic/apm-server/issues/18084 (I could be wrong, so feel free to disregard)

isaacaflores2 avatar Sep 22 '25 15:09 isaacaflores2

we need to remember to add docs as done here

raultorrecilla avatar Nov 12 '25 15:11 raultorrecilla

Have a set of Draft PRs up and tested manually as outlined in each PR desc. Final steps involve hashing out all the small details to get CI passing for each.

1. APM Server Self-Monitoring

  • Data Flow: APM Server → Elasticsearch (no Metricbeat)
  • Data Structure: Legacy index .monitoring-beats-7-* (no datastream)
  • APM Server collects and sends its own monitoring data directly to Elasticsearch
  • Requires monitoring.enabled: true and monitoring.elasticsearch in apm-server.yml config
  • Required PR:
    • Metrics implementation in APM Server, see https://github.com/elastic/apm-server/pull/19568
    • Updates to monitoring-beats.json in ES, see https://github.com/elastic/elasticsearch/pull/138131

2. Stack Monitoring (xpack)

  • Data Flow: APM Server + Metricbeat (xpack) + Elasticsearch
  • Data Structure: Datastream .monitoring-beats-8-mb
  • Required PR:
    • Metrics implementation in APM Server, see https://github.com/elastic/apm-server/pull/19568
    • Updates to monitoring-beats-mb.json in ES, depends on https://github.com/elastic/elasticsearch/pull/138131

3. Metricbeat (without xpack)

  • Data Flow: APM Server → Metricbeat (no xpack) → Elasticsearch
  • Data Structure: Datastream metricbeat-*
  • APM Server only needs to expose HTTP endpoints (via http.enabled: true) in config
  • Required PR:
    • Metrics implementation in APM Server, see https://github.com/elastic/apm-server/pull/19568
    • Updated to metricbeat template fields, see https://github.com/elastic/beats/pull/47709

4. EA Integration

  • Data Flow: APM Server + Elastic Agent + Elasticsearch (no Metricbeat)
  • Data Structure: Datastream metrics-elastic_agent.apm_server.*
  • Required PR:
    • Metrics implementation in APM Server, see https://github.com/elastic/apm-server/pull/19568
    • Update datastream fields, see https://github.com/elastic/integrations/pull/16560

rubvs avatar Nov 20 '25 00:11 rubvs

@raultorrecilla the link you shared in https://github.com/elastic/apm-server/issues/15533#issuecomment-3522656909 is referencing this issue. Can you update it please.

rubvs avatar Nov 20 '25 00:11 rubvs

Do we backport this change?

ericywl avatar Dec 01 '25 03:12 ericywl

Do we backport this change?

Quoting @carsonip from https://github.com/elastic/apm-server/issues/15533#issuecomment-3246025109:

As discussed during weekly, we would like to expand the scope of this task and ensure that it is relevant and valuable to both 8.x and 9.x. The description has been updated to reflect that. The actual metrics used and the design are up to the implementer.

These metrics will be useful for troubleshooting in 8.19 and 9.x (backport to currently active release branches).

simitt avatar Dec 01 '25 07:12 simitt