elasticsearch
elasticsearch copied to clipboard
Don't stop checking if the `HealthNode` persistent task is present
We assumed that once the HealthNode
persistent task is registered, we won't need to register it again. However, when, for instance, we restore from a snapshot (including cluster state) that was created in version <= 8.4.3, that task doesn't exist yet, which will result in the task being removed after the restore. By keeping the listener active, we will re-add the task after such a restore (or any other potential situation where the task might get deleted).
Fixes #98926
Pinging @elastic/es-data-management (Team:Data Management)
Hi @nielsbauman, I've created a changelog YAML for you.
@gmarouli what kind of test did you have in mind? I thought about testing as well but couldn't really come up with a good way to test this (but forgot to write that in the PR). I think that
- unit testing is going to be a bit hard here, as there wasn't really anything wrong with the "unit" (i.e. the
HealthNodeTaskExecutor
class/runTask
method), and testing thatremoveListener
is not called doesn't make a lot of sense to me. - adding a test here (David showed me how) would allow us to test this, but it feels a bit off testing this way, as the problem isn't technically related to the snapshot restore process; it's related to us removing the listener.
- the ideal way would be to have an integration test where we 1. assess health node is running, 2. somehow remove the health task, and 3. verify that the health task got started again. Do you know if step 2 here is (easily) possible?
Or perhaps you had another way of testing in mind?
@gmarouli what kind of test did you have in mind? I thought about testing as well but couldn't really come up with a good way to test this (but forgot to write that in the PR). I think that
- unit testing is going to be a bit hard here, as there wasn't really anything wrong with the "unit" (i.e. the
HealthNodeTaskExecutor
class/runTask
method), and testing thatremoveListener
is not called doesn't make a lot of sense to me.- adding a test here (David showed me how) would allow us to test this, but it feels a bit off testing this way, as the problem isn't technically related to the snapshot restore process; it's related to us removing the listener.
- the ideal way would be to have an integration test where we 1. assess health node is running, 2. somehow remove the health task, and 3. verify that the health task got started again. Do you know if step 2 here is (easily) possible?
Or perhaps you had another way of testing in mind?
Hm, fair point. I agree with your analysis above. The last one sounds great but I am not aware of how to delete the task. What about changing the current unit test: https://github.com/elastic/elasticsearch/blob/2cec43d0a8e49ca3debde64084803799cab78313/server/src/test/java/org/elasticsearch/health/node/selection/HealthNodeExecutorTests.java#L94-L109
to this:
public void testTaskCreation() throws Exception {
HealthNodeTaskExecutor.create(clusterService, persistentTasksService, featureService, settings, clusterSettings);
clusterService.getClusterApplierService().onNewClusterState("initialization", this::initialState, ActionListener.noop());
// Ensure that if the task is gone, it will be recreated.
clusterService.getClusterApplierService().onNewClusterState("initialization", this::initialState, ActionListener.noop());
assertBusy(
() -> verify(persistentTasksService, times(2)).sendStartRequest(
eq("health-node"),
eq("health-node"),
eq(new HealthNodeTaskParams()),
any()
)
);
}
PS: If it's not much trouble can you rename HealthNodeExecutorTests
to HealthNodeTaskExecutorTests
. This is probably my mistake and it's causing IDE navigation to not detect the test.
@gmarouli I quite like that solution, thanks a lot for the suggestion! And will rename the test class 👍🏻 .