monitoring: TBS lsm_size and value_log_size are reported less frequently than other TBS metrics
APM Server version (apm-server version): 9.0
Description of the problem including expected versus actual behavior:
apm-server.sampling.tail.storage.lsm_size and apm-server.sampling.tail.storage.value_log_size are reported less frequently (~15mins) than other TBS metrics e.g. apm-server.sampling.tail.dynamic_service_groups (~1min).
Also, apm-server.sampling.tail.storage.lsm_size and apm-server.sampling.tail.dynamic_service_groups are never reported together in 9.0.
Note that lsm_size and value_log_size are Int64ObservableGauge.
Work to do:
- Investigate if it affects other Observable metrics, and why they are mutually exclusive with other metrics.
- After understanding the reason, the fastest fix may be to switch to a non-observable metric
Steps to reproduce:
- In apm-server.yml set
http.enabled: trueand enable TBS - Observe lsm_size and dynamic_service_groups at stats endpoint (default localhost:5066/stats)
- (if you don't notice anything wrong) try with metricbeat with beats module scraping apm-server
Provide logs (if relevant):
apm-server.sampling.tail.storage.lsm_size and apm-server.sampling.tail.storage.value_log_size appear to be the only two asynchronous (observable) TBS metrics which explains the different reporting interval.
My understanding is the meterProvider can be configured with a different interval using something like a PeriodicReader but I do not see such configuration. The only meterProvider I find is here which is using ManualReaders. I cannot find any collection interval in the apm-server codebase.
Since we want some more control on when these two metrics are reported I will work on converting to synchronous measurements.
which explains the different reporting interval.
My experience is that lsm_size and value_log_size just won't show up in localhost:5066/stats.
Since we want some more control on when these two metrics are reported I will work on converting to synchronous measurements.
++ on converting to synchronous, as long as you confirm that the synchronous measurement is regularly updated by a background goroutine, which should be the case.
the only two asynchronous (observable) TBS metrics
A quick search seems to show there are some other observable non-TBS metrics. If they exhibit the same issue, which I suppose they would, we can tackle them together now or in a follow up PR, or create a separate issue to tackle them with a lower urgency.
Sounds good. I think we can review the other observable metrics and create another issue. I still need understand how current metric interval is determined for apm server.
PR is open which takes advantage of an existing storage manager background goroutiner.
9.0 and main should be fixed but reopening for 8.19
@carsonip I looked into the other Observable (async) metrics for apm-server. I only see two metrics that do the same thing for the endpoint handlers:
- grpc.go: https://github.com/elastic/apm-server/blob/main/internal/beater/otlp/grpc.go#L62-L81
- http.go: https://github.com/elastic/apm-server/blob/main/internal/beater/otlp/http.go#L59-L78
The counters are for dropped metrics that are tracked by the apm-data consumer ConsumeMetricsWithResult method: https://github.com/elastic/apm-data/blob/main/input/otlp/metrics.go#L77-L80.
I see two options to convert these two synchronous metrics:
- There is an existing comment that already mentions (so update the
apm-datalib):
// TODO we should add an otel counter metric directly in the
// apm-data consumer, then we could get rid of the callback.
- We can have each handler/export report the metric data for each http/rpc request. These both call
ConsumeMetricsWithResultwhich is the consumer method that updated theunsupportedMetricsDroppedcount
- https://github.com/elastic/apm-server/blob/main/internal/beater/otlp/grpc.go#L118-L121
- https://github.com/elastic/apm-server/blob/main/internal/beater/otlp/http.go#L123-L125
Let me know what you think
- There is an existing comment that already mentions (so update the apm-data lib):
This sounds like the right way to go.
- We can have each handler/export report the metric data for each http/rpc request.
It was done only to support the partial success interface defined by spec. Let's keep it clean and specific to this purpose.
Seems that any fixes to this otlp metric will be too late to catch the 9.0.3 train. Do you mind opening another issue to track this, targeting 9.0.4, 9.1.0 and 8.19.0, and close this issue specific to TBS metrics? Thanks
Fix for tbs metrics: apm-server.sampling.tail.storage.lsm_size and apm-server.sampling.tail.storage.value_log_size merged.
I will open a new issue to track updated for otlp metrics listed above