elasticsearch
elasticsearch copied to clipboard
Expose effective watermark thresholds via APIs
The effective low/high/flood watermarks that apply to the cluster depends on both the thresholds and the max_headrooms settings. To make it easier to see what the effective value is, we could calculate/expose the threshold under _nodes/stats
and/or _cat/allocation
APIs.
Pinging @elastic/es-distributed (Team:Distributed)
Hi @pxsalehi , shall we need to add an extra disk watermark thresholds into _cat/allocation
APIs? Since each node thresholds would be the same, is that necessary to add these thresholds in the follow default colums?
_cat/allocations
:
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node 16 5.7tb 5.9tb 958.9gb 6.9tb 86 127.0.0.1 127.0.0.1 node-1
I think the idea is for example to have for each watermark a new column (and also available in _nodes/stats
) that would provide the effective watermark in %. E.g. if the default watermark is 90% but the max_headroom is the one that decides the watermark, calculate the effective value based on that, e.g. for a large disk it might be 99%.
I expect this'll take several PRs to completely address. I'd suggest exposing the raw watermark numbers in GET _nodes/stats
first and then we can think about adding columns to GET _cat/allocation
in a follow-up PR.
It will certainly be useful to display the actual watermarks as %age values, but in many cases they're all going to come out as 99%
which isn't really very helpful. We can add some decimal places, but IME folks really want to know the size of the gap (as a bytes value) between the actual disk usage and each of the three watermarks. This is going to be a little tricky since today ByteSizeValue
doesn't support negative sizes, and yet here we need some way to represent being both under and over each watermark. Not impossible at all, just a little more complex than it might first appear bearing in mind that we must integrate properly with the ?s=
and ?bytes=
query parameters. I might suggest adding the percentages in one PR and then thinking about these more useful columns in another.
Appreciate the help in https://github.com/elastic/elasticsearch/pull/107244, @DaveCTurner . So next step we are going to add extra threshold columns in _cat/allocation
?
Yep that's right
Hi @DaveCTurner, if a node has multiple disk paths, then cat/allocation
only shows the max low/high/flood watermark byte size value right?
I don't think there's a good way to represent these values on a node-by-node basis if multiple data paths are in play. I think it would be best to just display something like <multiple>
in that case.
However, I would like us to have a column indicating the watermark status of each node, NONE
/LOW
/HIGH
/FLOOD
, , and also columns showing how far below each watermark each node is (taking the minimum value if there are multiple data paths). That should make sense. But it's a little tricky to do this because a node which is above a watermark would need to represent this as a negative number, and ByteSizeValue
doesn't support negative values, so some extra work is needed to get this right.
I think it's better to show all the three thresholds, this would get global perspective.
I create an example table for cat/allocation
result:
shards | disk.indices | disk.used | disk.avail | disk.total | disk.percent | low_wm | high_wm | flood_wm | host | ip | node |
---|---|---|---|---|---|---|---|---|---|---|---|
117 | 2.7tb | 6.7tb | 0.2tb | 6.9tb | 81 | 340.5gb | 560.5gb | 750.7gb | 127.0.0.1 | 127.0.0.1 | data-node1 |
117 | 2.7tb | 6.8tb | 0.1tb | 6.9tb | 87 | -38.7gb (above low) | 230.8gb | 450.7gb | 127.0.0.1 | 127.0.0.1 | data-node2 |
117 | 2.7tb | 3.2gb | 3.5gb | 7.0gb | 51 | - | - | - | 127.0.0.2 | 127.0.0.2 | master-node |
Yes, sorry, to clarify I think we should add seven new columns (names TBD but something like this):
-
disk.watermark.low.threshold
,disk.watermark.high.threshold
,disk.watermark.flood_stage.threshold
: the raw values as calculated in #107244, as a regularByteSizeValue
column, except if the node has multiple data paths then show a placeholder like<multiple>
-
disk.watermark.low.avail
,disk.watermark.high.avail
,disk.watermark.flood_stage.avail
: the available space between the current disk usage and the relevant watermark, as a signedByteSizeValue
column (negative meaning the watermark has been exceeded) taking the minimum across data paths if the node has multiple. -
disk.watermark.exceeded
: a column containing a stringNONE
,LOW
,HIGH
orFLOOD_STAGE
indicating at a glance which of the watermarks have been exceeded on each node (where sorting on this column should order theNONE
rows before theLOW
rows, then theHIGH
rows, and finally theFLOOD_STAGE
rows).
Support negative size in ByteSizeValue
https://github.com/elastic/elasticsearch/pull/107988.