elasticsearch_exporter Index lifecycle management execution metrics

I already suggested it in #306, but got closed by me before the repo got transferred to prometheus-community. Maybe it could be worth making another one pull request cause I believe I might be not the only one who'll find these metrics useful.

Basically in my daily routine I'd like to monitor ILM execution stats - how many indexes are covered by ILM policies, how many errors I've got, etc. This is a simple representation of my goal I use since 7.3.2, now I'm on 7.11.smth and it still works. Haven't seen any changes to ILM recently so I assume it's compatible with any 7.* and maybe even earlier.

Example of metrics available:

elasticsearch_ilm_index_status{action="rollover",index="foo_2",phase="hot",step="check-rollover-ready"} 1
elasticsearch_ilm_index_status{action="shrink",index="foo_3",phase="warm",step="shrunk-shards-allocated"} 1
elasticsearch_ilm_index_status{action="complete",index="foo_4",phase="warm",step="complete"} 1
elasticsearch_ilm_index_status{action="complete",index="foo_5",phase="hot",step="complete"} 1
elasticsearch_ilm_index_status{action="complete",index="foo_6",phase="new",step="complete"} 1
elasticsearch_ilm_index_status{action="",index="foo_7",phase="",step=""} 0

Numeric values represent if exact index is covered by ILM policy at all (in the example above index foo_7 has no policy attached, other have one). Everything else in tags is just _all/_ilm/explain API result.

Jul 18 '21 21:07 mokrinsky

Hi, Can we ask when it will be merged, please? Not a lot of code, it could be checked quite quickly, and that metrics will be very useful. Thank You Kudos @mokrinsky :1st_place_medal:

Aug 26 '21 12:08 wojtas911

Is this not missing labels such as cluster ?

Oct 05 '21 08:10 tgrondier

@tgrondier yes, it actually misses them. I have cluster tag in my prometheus environment added by default, so I missed its absence in exporter. Gonna fix soon.

Oct 05 '21 11:10 mokrinsky

Hi 👋

Any news on this PR?

@mokrinsky you will terminate the work for preparing to merge it?

Feb 14 '22 21:02 paulojmdias

Also interested if this is going to be picked up & finished off

We've pulled this change into our fork and it does do the job. I think there's room for improvement when it comes to using these metrics for alerting, specifically around actions that can be retried:

The metric does not tell you whether an action can be retried, and if so how many retries have been attempted
When an action is retrying, the Error metric disappears while the action is retried

Both of these factors make it a bit more difficult to alert on. For our case we'd want to alert on:

A failed action that is not retriable
A failed action that is retriable but has failed n number of retries

Not sure exactly what the metrics would look like for this. It's difficult as the ILM explain API itself hides the error state when the action is retrying.

May 12 '22 14:05 Evesy

elasticsearch_exporter elasticsearch_exporter copied to clipboard

Index lifecycle management execution metrics

elasticsearch_exporter
elasticsearch_exporter copied to clipboard