elasticsearch
elasticsearch copied to clipboard
Disk Usage health indicator
Create a disk usage indicator that report to the user when their cluster is running out of space and the impact this has for its function. We propose the following health status and their interpretations:
Status | Meaning | Implementation |
---|---|---|
RED | the disk is running out of space on at least one node or writes are blocked because of limited disk space. | At least one node is above the flooding watermark, or at least one index is blocked by READ_ONLY_ALLOW_DELETE_BLOCK |
YELLOW | There is increased disk usage on at least one node. | At least one data node is above the high watermark with no relocating shards or a non-data node is above the high watermark.* |
GREEN | All good, nothing elasticsearch cannot handle :) . | If none of the above apply. |
Implementation details The collection of the data should be done using the persistent tasks frameworks.
Nodes will listen to cluster state changes for the allocation of the "health persistent task" and push their initial status. After the initialization, the nodes will only push changes to their state (ie. when they change from RED to YELLOW).
The allocated persistent task should be prepared to delay a potential initial request for health if the request arrives before it got a chance to receive the statuses from the nodes.
- [x] Introduce the persistent task (https://github.com/elastic/elasticsearch/pull/86131)
- [x] Propagate disk usage thresholds and watermarks to all nodes (https://github.com/elastic/elasticsearch/pull/88175)
- [x] Introduce thresholds for non-data nodes (parked for now, we want to see if reusing the flood stage and high watermarks is good enough.
- [x] Monitor a node's disk usage health (https://github.com/elastic/elasticsearch/pull/88390)
- [x] [Health node] Cache each node's disk usage health (https://github.com/elastic/elasticsearch/pull/89275)
- [x] The coordinating node retrieves the health info from the health node (https://github.com/elastic/elasticsearch/pull/89820, https://github.com/elastic/elasticsearch/pull/89947)
- [x] Use the retrieved disk usage health info and the blocked indices from the cluster state (if they exist) to compute the indicator (https://github.com/elastic/elasticsearch/pull/90041)
- [ ] Remove the feature flag (https://github.com/elastic/elasticsearch/pull/90085)
- [ ] Write troubleshooting doc & document the new settings
Pinging @elastic/es-data-management (Team:Data Management)
We already collect disk information from every node in the cluster using the InternalClusterInfoService
which updates every 30 seconds, is there a reason why we need an additional task framework for watching the usages outside of gathering them on-demand using the ClusterInfo
object?
@dakrone this is just the initial metric we'll be exposing (and building the infrastructure as part of it). Other future metrics we'd report on will be network disruptions, cpu usage thresholds being tripped, jvm level information etc.
We don't want to burden the master node further with collecting this information.
I think we might need more nuance here.
The goal of the disk-based shard allocator is really to keep disk usage below the high watermark on each node, so I don't think we should be reporting ill health (i.e. requiring action) if a single node exceeds the low watermark. Action starts to become important if too many nodes are above the low watermark (resulting in unassigned replicas or shards that cannot be migrated between tiers). We can perhaps look ahead a bit, e.g. warning if there are <(#replicas+1) nodes below the low watermark in any given tier since that will cause problems at the next rollover/migration.
Similarly the disk-based shard allocator doesn't always achieve its goal so occasional breaches of the high watermark are expected. Breaching the high watermark is necessary to trigger shard movements to address the disk usage. Again, no action is needed from the user unless the ongoing shard movements won't fix the problem. See e.g. DiskThresholdMonitor#nodesOverHighThresholdAndRelocating
and the associated logging.
(Also I assume the watermark numbers here won't be these literal numbers and will instead come from the master's config)
@DaveCTurner I do agree that we need something a bit more complex than simply alerting on the watermarks. I would say that the golden rule of alerting is that it needs to be actionable. Otherwise the alerts will soon be ignored by the users.
I will try to learn a bit more about what the disk usage entails for elasticsearch and in which cases the user can do something, like scaling or deleting indices and I will report back here.
Requirement update
We took into consideration all the concerns above and we came up with the following requirements which we believe is a good balance of reduced complexity and usefulness.
Status | Meaning | Implementation |
---|---|---|
RED for data nodes | the disk is running out of space on at least one node and writes are blocked because of this. | At least one data node is above the flooding watermark, or at least one index is blocked by READ_ONLY_ALLOW_DELETE_BLOCK |
RED for non-data nodes | the disk is running out of space and that might cause issues. | If at least one of the non-data nodes is above the red threshold* |
YELLOW | There is increased disk usage on at least one node. | At least one data node is above the high watermark with no relocating shards, or at least one of the non-data nodes is above the yellow threshold* |
GREEN | All good, nothing elasticsearch cannot handle :) . | If none of the above apply. |
- Red and yellow threshold are introduced as part of this proposal.
I will update the description of the issue soon to reflect the above.