elasticsearch_exporter icon indicating copy to clipboard operation
elasticsearch_exporter copied to clipboard

Index lifecycle management execution metrics

Open mokrinsky opened this issue 3 years ago • 5 comments

I already suggested it in #306, but got closed by me before the repo got transferred to prometheus-community. Maybe it could be worth making another one pull request cause I believe I might be not the only one who'll find these metrics useful.

Basically in my daily routine I'd like to monitor ILM execution stats - how many indexes are covered by ILM policies, how many errors I've got, etc. This is a simple representation of my goal I use since 7.3.2, now I'm on 7.11.smth and it still works. Haven't seen any changes to ILM recently so I assume it's compatible with any 7.* and maybe even earlier.

Example of metrics available:

elasticsearch_ilm_index_status{action="rollover",index="foo_2",phase="hot",step="check-rollover-ready"} 1
elasticsearch_ilm_index_status{action="shrink",index="foo_3",phase="warm",step="shrunk-shards-allocated"} 1
elasticsearch_ilm_index_status{action="complete",index="foo_4",phase="warm",step="complete"} 1
elasticsearch_ilm_index_status{action="complete",index="foo_5",phase="hot",step="complete"} 1
elasticsearch_ilm_index_status{action="complete",index="foo_6",phase="new",step="complete"} 1
elasticsearch_ilm_index_status{action="",index="foo_7",phase="",step=""} 0

Numeric values represent if exact index is covered by ILM policy at all (in the example above index foo_7 has no policy attached, other have one). Everything else in tags is just _all/_ilm/explain API result.

mokrinsky avatar Jul 18 '21 21:07 mokrinsky

Hi, Can we ask when it will be merged, please? Not a lot of code, it could be checked quite quickly, and that metrics will be very useful. Thank You Kudos @mokrinsky :1st_place_medal:

wojtas911 avatar Aug 26 '21 12:08 wojtas911

Is this not missing labels such as cluster ?

tgrondier avatar Oct 05 '21 08:10 tgrondier

@tgrondier yes, it actually misses them. I have cluster tag in my prometheus environment added by default, so I missed its absence in exporter. Gonna fix soon.

mokrinsky avatar Oct 05 '21 11:10 mokrinsky

Hi 👋

Any news on this PR?

@mokrinsky you will terminate the work for preparing to merge it?

paulojmdias avatar Feb 14 '22 21:02 paulojmdias

Also interested if this is going to be picked up & finished off

We've pulled this change into our fork and it does do the job. I think there's room for improvement when it comes to using these metrics for alerting, specifically around actions that can be retried:

  • The metric does not tell you whether an action can be retried, and if so how many retries have been attempted
  • When an action is retrying, the Error metric disappears while the action is retried

Both of these factors make it a bit more difficult to alert on. For our case we'd want to alert on:

  • A failed action that is not retriable
  • A failed action that is retriable but has failed n number of retries

Not sure exactly what the metrics would look like for this. It's difficult as the ILM explain API itself hides the error state when the action is retrying.

Evesy avatar May 12 '22 14:05 Evesy