apm-server icon indicating copy to clipboard operation
apm-server copied to clipboard

monitoring: TBS lsm_size and value_log_size are reported less frequently than other TBS metrics

Open carsonip opened this issue 6 months ago • 5 comments

APM Server version (apm-server version): 9.0

Description of the problem including expected versus actual behavior:

apm-server.sampling.tail.storage.lsm_size and apm-server.sampling.tail.storage.value_log_size are reported less frequently (~15mins) than other TBS metrics e.g. apm-server.sampling.tail.dynamic_service_groups (~1min).

Also, apm-server.sampling.tail.storage.lsm_size and apm-server.sampling.tail.dynamic_service_groups are never reported together in 9.0.

Note that lsm_size and value_log_size are Int64ObservableGauge.

Work to do:

  • Investigate if it affects other Observable metrics, and why they are mutually exclusive with other metrics.
  • After understanding the reason, the fastest fix may be to switch to a non-observable metric

Steps to reproduce:

  • In apm-server.yml set http.enabled: true and enable TBS
  • Observe lsm_size and dynamic_service_groups at stats endpoint (default localhost:5066/stats)
  • (if you don't notice anything wrong) try with metricbeat with beats module scraping apm-server

Provide logs (if relevant):

carsonip avatar Jun 05 '25 13:06 carsonip

apm-server.sampling.tail.storage.lsm_size and apm-server.sampling.tail.storage.value_log_size appear to be the only two asynchronous (observable) TBS metrics which explains the different reporting interval.

My understanding is the meterProvider can be configured with a different interval using something like a PeriodicReader but I do not see such configuration. The only meterProvider I find is here which is using ManualReaders. I cannot find any collection interval in the apm-server codebase.

Since we want some more control on when these two metrics are reported I will work on converting to synchronous measurements.

isaacaflores2 avatar Jun 09 '25 21:06 isaacaflores2

which explains the different reporting interval.

My experience is that lsm_size and value_log_size just won't show up in localhost:5066/stats.

Since we want some more control on when these two metrics are reported I will work on converting to synchronous measurements.

++ on converting to synchronous, as long as you confirm that the synchronous measurement is regularly updated by a background goroutine, which should be the case.

the only two asynchronous (observable) TBS metrics

A quick search seems to show there are some other observable non-TBS metrics. If they exhibit the same issue, which I suppose they would, we can tackle them together now or in a follow up PR, or create a separate issue to tackle them with a lower urgency.

carsonip avatar Jun 09 '25 22:06 carsonip

Sounds good. I think we can review the other observable metrics and create another issue. I still need understand how current metric interval is determined for apm server.

PR is open which takes advantage of an existing storage manager background goroutiner.

isaacaflores2 avatar Jun 10 '25 00:06 isaacaflores2

9.0 and main should be fixed but reopening for 8.19

carsonip avatar Jun 11 '25 08:06 carsonip

@carsonip I looked into the other Observable (async) metrics for apm-server. I only see two metrics that do the same thing for the endpoint handlers:

  1. grpc.go: https://github.com/elastic/apm-server/blob/main/internal/beater/otlp/grpc.go#L62-L81
  2. http.go: https://github.com/elastic/apm-server/blob/main/internal/beater/otlp/http.go#L59-L78

The counters are for dropped metrics that are tracked by the apm-data consumer ConsumeMetricsWithResult method: https://github.com/elastic/apm-data/blob/main/input/otlp/metrics.go#L77-L80.

I see two options to convert these two synchronous metrics:

  1. There is an existing comment that already mentions (so update the apm-data lib):
// TODO we should add an otel counter metric directly in the
// apm-data consumer, then we could get rid of the callback.
  1. We can have each handler/export report the metric data for each http/rpc request. These both call ConsumeMetricsWithResult which is the consumer method that updated the unsupportedMetricsDropped count
  • https://github.com/elastic/apm-server/blob/main/internal/beater/otlp/grpc.go#L118-L121
  • https://github.com/elastic/apm-server/blob/main/internal/beater/otlp/http.go#L123-L125

Let me know what you think

isaacaflores2 avatar Jun 17 '25 23:06 isaacaflores2

  1. There is an existing comment that already mentions (so update the apm-data lib):

This sounds like the right way to go.

  1. We can have each handler/export report the metric data for each http/rpc request.

It was done only to support the partial success interface defined by spec. Let's keep it clean and specific to this purpose.

Seems that any fixes to this otlp metric will be too late to catch the 9.0.3 train. Do you mind opening another issue to track this, targeting 9.0.4, 9.1.0 and 8.19.0, and close this issue specific to TBS metrics? Thanks

carsonip avatar Jun 19 '25 17:06 carsonip

Fix for tbs metrics: apm-server.sampling.tail.storage.lsm_size and apm-server.sampling.tail.storage.value_log_size merged.

I will open a new issue to track updated for otlp metrics listed above

isaacaflores2 avatar Jun 20 '25 22:06 isaacaflores2