elasticsearch_exporter
elasticsearch_exporter copied to clipboard
Index lifecycle management execution metrics
I already suggested it in #306, but got closed by me before the repo got transferred to prometheus-community. Maybe it could be worth making another one pull request cause I believe I might be not the only one who'll find these metrics useful.
Basically in my daily routine I'd like to monitor ILM execution stats - how many indexes are covered by ILM policies, how many errors I've got, etc. This is a simple representation of my goal I use since 7.3.2, now I'm on 7.11.smth and it still works. Haven't seen any changes to ILM recently so I assume it's compatible with any 7.* and maybe even earlier.
Example of metrics available:
elasticsearch_ilm_index_status{action="rollover",index="foo_2",phase="hot",step="check-rollover-ready"} 1
elasticsearch_ilm_index_status{action="shrink",index="foo_3",phase="warm",step="shrunk-shards-allocated"} 1
elasticsearch_ilm_index_status{action="complete",index="foo_4",phase="warm",step="complete"} 1
elasticsearch_ilm_index_status{action="complete",index="foo_5",phase="hot",step="complete"} 1
elasticsearch_ilm_index_status{action="complete",index="foo_6",phase="new",step="complete"} 1
elasticsearch_ilm_index_status{action="",index="foo_7",phase="",step=""} 0
Numeric values represent if exact index is covered by ILM policy at all (in the example above index foo_7 has no policy attached, other have one). Everything else in tags is just _all/_ilm/explain API result.
Hi, Can we ask when it will be merged, please? Not a lot of code, it could be checked quite quickly, and that metrics will be very useful. Thank You Kudos @mokrinsky :1st_place_medal:
Is this not missing labels such as cluster
?
@tgrondier yes, it actually misses them. I have cluster tag in my prometheus environment added by default, so I missed its absence in exporter. Gonna fix soon.
Hi 👋
Any news on this PR?
@mokrinsky you will terminate the work for preparing to merge it?
Also interested if this is going to be picked up & finished off
We've pulled this change into our fork and it does do the job. I think there's room for improvement when it comes to using these metrics for alerting, specifically around actions that can be retried:
- The metric does not tell you whether an action can be retried, and if so how many retries have been attempted
- When an action is retrying, the
Error
metric disappears while the action is retried
Both of these factors make it a bit more difficult to alert on. For our case we'd want to alert on:
- A failed action that is not retriable
- A failed action that is retriable but has failed n number of retries
Not sure exactly what the metrics would look like for this. It's difficult as the ILM explain API itself hides the error state when the action is retrying.