monitoring: expose storage limit and disk metrics for TBS monitoring
From comment https://github.com/elastic/apm-server/issues/14247#issuecomment-2576116925
8.x and 9.x: As apm-server exposes lsm_size and value_log_size as monitoring metrics, expose configured storage limit as well. Then it will be possible to plot db size vs storage limit, and removes the need to dig into the logs for the configured storage limit.
9.x: Add monitoring metrics to monitor the disk utilization check.
It is up to the implementer to decide what metric to emit, whether it is a combined metric or a few separate metrics. The actual work involves adding this metric in apm-server code, and update relevant mappings in ES, integrations and metricbeat repo. See https://github.com/elastic/apm-server/issues/13475 for an example.
this is important as we would like to add some UI to tell the customers how much disk they are using https://github.com/elastic/kibana/issues/226600
This is less straightforward than before. In 8.x there is one configured storage limit, and we compare db size against that. However, from 9.0, by default the storage limit is 0, and instead of comparing db size against that, we compare the disk_used vs disk_total so that TBS doesn't write to the last 20% of the disk. It does not involve db size and storage limit. I'm a bit hesitant to plumb these implementation details to monitoring metrics and all the way to the UI. We should rethink what should be exposed.
apm-server.sampling.tail.events.failed_writes is a counter metric that records the number of failed writes, no matter it is then discarded or directly indexed to ES. If this per-apm-server counter increases, it means apm-server is running into storage issues. I believe it is good enough for https://github.com/elastic/kibana/issues/226600
That said, for https://github.com/elastic/kibana/issues/226600 , I wonder how kibana would have access to monitoring metrics (think about stack monitoring). If we cannot surface this by default on ECH, I wonder how useful this will be.
In case we move ahead with implementing this, the actual work involves adding this metric in apm-server code, and update relevant mappings in ES, integrations and metricbeat repo. See https://github.com/elastic/apm-server/issues/13475 for an example.
As discussed during weekly, we would like to expand the scope of this task and ensure that it is relevant and valuable to both 8.x and 9.x. The description has been updated to reflect that. The actual metrics used and the design are up to the implementer.
As mentioned during the weekly, this story is similar to: https://github.com/elastic/apm-server/issues/18084. Cross posting just to make sure everyone is aware. Depending on how the metric for 9.X is implement it might cover one of the cases in https://github.com/elastic/apm-server/issues/18084 (I could be wrong, so feel free to disregard)
we need to remember to add docs as done here
Have a set of Draft PRs up and tested manually as outlined in each PR desc. Final steps involve hashing out all the small details to get CI passing for each.
1. APM Server Self-Monitoring
- Data Flow: APM Server → Elasticsearch (no Metricbeat)
- Data Structure: Legacy index
.monitoring-beats-7-*(no datastream) - APM Server collects and sends its own monitoring data directly to Elasticsearch
- Requires
monitoring.enabled: trueandmonitoring.elasticsearchinapm-server.ymlconfig - Required PR:
- Metrics implementation in APM Server, see https://github.com/elastic/apm-server/pull/19568
- Updates to
monitoring-beats.jsonin ES, see https://github.com/elastic/elasticsearch/pull/138131
2. Stack Monitoring (xpack)
- Data Flow: APM Server + Metricbeat (xpack) + Elasticsearch
- Data Structure: Datastream
.monitoring-beats-8-mb - Required PR:
- Metrics implementation in APM Server, see https://github.com/elastic/apm-server/pull/19568
- Updates to
monitoring-beats-mb.jsonin ES, depends on https://github.com/elastic/elasticsearch/pull/138131
3. Metricbeat (without xpack)
- Data Flow: APM Server → Metricbeat (no xpack) → Elasticsearch
- Data Structure: Datastream
metricbeat-* - APM Server only needs to expose HTTP endpoints (via
http.enabled: true) in config - Required PR:
- Metrics implementation in APM Server, see https://github.com/elastic/apm-server/pull/19568
- Updated to
metricbeattemplate fields, see https://github.com/elastic/beats/pull/47709
4. EA Integration
- Data Flow: APM Server + Elastic Agent + Elasticsearch (no Metricbeat)
- Data Structure: Datastream
metrics-elastic_agent.apm_server.* - Required PR:
- Metrics implementation in APM Server, see https://github.com/elastic/apm-server/pull/19568
- Update datastream fields, see https://github.com/elastic/integrations/pull/16560
@raultorrecilla the link you shared in https://github.com/elastic/apm-server/issues/15533#issuecomment-3522656909 is referencing this issue. Can you update it please.
Do we backport this change?
Do we backport this change?
Quoting @carsonip from https://github.com/elastic/apm-server/issues/15533#issuecomment-3246025109:
As discussed during weekly, we would like to expand the scope of this task and ensure that it is relevant and valuable to both 8.x and 9.x. The description has been updated to reflect that. The actual metrics used and the design are up to the implementer.
These metrics will be useful for troubleshooting in 8.19 and 9.x (backport to currently active release branches).