elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

Don't stop checking if the `HealthNode` persistent task is present

Open nielsbauman opened this issue 1 year ago • 5 comments

We assumed that once the HealthNode persistent task is registered, we won't need to register it again. However, when, for instance, we restore from a snapshot (including cluster state) that was created in version <= 8.4.3, that task doesn't exist yet, which will result in the task being removed after the restore. By keeping the listener active, we will re-add the task after such a restore (or any other potential situation where the task might get deleted).

Fixes #98926

nielsbauman avatar Feb 13 '24 11:02 nielsbauman

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine avatar Feb 13 '24 11:02 elasticsearchmachine

Hi @nielsbauman, I've created a changelog YAML for you.

elasticsearchmachine avatar Feb 13 '24 11:02 elasticsearchmachine

@gmarouli what kind of test did you have in mind? I thought about testing as well but couldn't really come up with a good way to test this (but forgot to write that in the PR). I think that

  1. unit testing is going to be a bit hard here, as there wasn't really anything wrong with the "unit" (i.e. the HealthNodeTaskExecutor class/runTask method), and testing that removeListener is not called doesn't make a lot of sense to me.
  2. adding a test here (David showed me how) would allow us to test this, but it feels a bit off testing this way, as the problem isn't technically related to the snapshot restore process; it's related to us removing the listener.
  3. the ideal way would be to have an integration test where we 1. assess health node is running, 2. somehow remove the health task, and 3. verify that the health task got started again. Do you know if step 2 here is (easily) possible?

Or perhaps you had another way of testing in mind?

nielsbauman avatar Feb 14 '24 16:02 nielsbauman

@gmarouli what kind of test did you have in mind? I thought about testing as well but couldn't really come up with a good way to test this (but forgot to write that in the PR). I think that

  1. unit testing is going to be a bit hard here, as there wasn't really anything wrong with the "unit" (i.e. the HealthNodeTaskExecutor class/runTask method), and testing that removeListener is not called doesn't make a lot of sense to me.
  2. adding a test here (David showed me how) would allow us to test this, but it feels a bit off testing this way, as the problem isn't technically related to the snapshot restore process; it's related to us removing the listener.
  3. the ideal way would be to have an integration test where we 1. assess health node is running, 2. somehow remove the health task, and 3. verify that the health task got started again. Do you know if step 2 here is (easily) possible?

Or perhaps you had another way of testing in mind?

Hm, fair point. I agree with your analysis above. The last one sounds great but I am not aware of how to delete the task. What about changing the current unit test: https://github.com/elastic/elasticsearch/blob/2cec43d0a8e49ca3debde64084803799cab78313/server/src/test/java/org/elasticsearch/health/node/selection/HealthNodeExecutorTests.java#L94-L109

to this:

public void testTaskCreation() throws Exception {
        HealthNodeTaskExecutor.create(clusterService, persistentTasksService, featureService, settings, clusterSettings);
        clusterService.getClusterApplierService().onNewClusterState("initialization", this::initialState, ActionListener.noop());
        // Ensure that if the task is gone, it will be recreated.
        clusterService.getClusterApplierService().onNewClusterState("initialization", this::initialState, ActionListener.noop());
        assertBusy(
            () -> verify(persistentTasksService, times(2)).sendStartRequest(
                eq("health-node"),
                eq("health-node"),
                eq(new HealthNodeTaskParams()),
                any()
            )
        );
    }

PS: If it's not much trouble can you rename HealthNodeExecutorTests to HealthNodeTaskExecutorTests. This is probably my mistake and it's causing IDE navigation to not detect the test.

gmarouli avatar Feb 15 '24 07:02 gmarouli

@gmarouli I quite like that solution, thanks a lot for the suggestion! And will rename the test class 👍🏻 .

nielsbauman avatar Feb 15 '24 13:02 nielsbauman