elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

Fix TasksIT#testGetTaskWaitForCompletionWithoutStoringResult

Open arteam opened this issue 9 months ago • 9 comments

Make sure the .tasks index is created before we starting testing task completion without storing its result. To achieve that, we store a fake task before we start waitForCompletionTestCase.

It seems that the failure (the missed index) has always existed in the test scenario and it's supposed to be handled by TransportGetTaskAction.java. We catch IndexNotFoundException here and convert it to ResourceNotFoundException. Then we catch ResourceNotFoundException here and return a snapshot of a task as a response.

So, creating the index explicitly seems to be the right solution, so we can keep the test focused on handling waiting for completion, rather than handling of the missed index/task for which there are other tests.

Resolves #107823

arteam avatar Apr 30 '24 14:04 arteam

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine avatar Apr 30 '24 14:04 elasticsearchmachine

@elasticmachine update branch

arteam avatar May 02 '24 06:05 arteam

@elasticmachine update branch

arteam avatar May 07 '24 14:05 arteam

The linked issue says that the tasks index got deleted, but that does not seem to match the resolution here? Can we find out why the tasks index was deleted too soon instead?

henningandersen avatar May 10 '24 14:05 henningandersen

@henningandersen I believe the comment in the linked issue is wrong. The index was never deleted, because the test doesn't create the index. The test waits for the a completion of a task and the tasks only completes, because we have special error handling for the case where the index doesn't exist. I guess in some cases the error handling doesn't can't figure out that the root cause was IndexNotFoundException which should be converted to ResourceNotFoundException which is silently ignored.

I believe we shoud just explicitly create the index, because testGetTaskWaitForCompletionWithoutStoringResult is supposed to test task completion, not the error handling for missed indexes which is done in testGetTaskNotFound and testTasksGetWaitForNoTask.

arteam avatar May 13 '24 07:05 arteam

@arteam it still smells like we might be covering up for a bug here. AFAICS, we expect the logic to work regardless of whether the index exists or not. Can you elaborate on how the test differentiates between whether the task exists or not? Since it if it is within the actual tasks code, we may want to target that instead (as well as add a dedicated test for it).

henningandersen avatar May 15 '24 06:05 henningandersen

On Wed, May 15, 2024 at 3:02 PM Artem Prigoda @.***> wrote:

Started digging more deeply and the test stopped failing after #108052 https://github.com/elastic/elasticsearch/pull/108052 got merged

I'm pretty sure #108052 had no effect here, it was a pure refactoring.

Message ID: @.***>

DaveCTurner avatar May 15 '24 14:05 DaveCTurner

I'm pretty sure #108052 had no effect here, it was a pure refactoring.

Sorry about that! I deleted my comment right I realized that #108052 indeed just removed dead code, I was confused by the line numbers in the stack trace.

arteam avatar May 15 '24 15:05 arteam

Still, the only way can I see the test failing is ExceptionsHelper.unwrap(e, ResourceNotFoundException.class) returning null. In fact, if I replace it with if (false) the error stack trace seems exactly like the one in the issue. Not sure how it is possible, though.

arteam avatar May 15 '24 15:05 arteam

@elasticmachine update branch

arteam avatar May 18 '24 01:05 arteam

The main problem seems to be that the test case does not find the task running, see this part of the stack trace:

      at org.elasticsearch.action.admin.cluster.node.tasks.get.TransportGetTaskAction.getRunningTaskFromNode(TransportGetTaskAction.java:140)

which is this line.

This is where the focus should go I think. The test ran in less than 50ms, so it is not something timeout related, rather likely some race. I did a bit of digging but did not find it.

I do notice that the test case is a suite case, which are sometimes disturbed by prior test. I did not find any such evidence though, so might be a red herring.

I notice that the test writes Test task finished on the node, so the test task was not cancelled either, since then I believe it would not output that.

henningandersen avatar May 19 '24 19:05 henningandersen

@elasticmachine update branch

arteam avatar May 21 '24 07:05 arteam

@henningandersen That was a very good catch! getFinishedTaskFromIndex was called from getRunningTaskFromNode, not from waitedForCompletion. There indeed seems to be a race between creating a get request and unblocking request which are sent asynchronously. I've changed waitForCompletionTestCase to unblock the task only after the request started waiting for the task completion by registering a removal listener. By doing so, we make sure we test the "wait for completion" branch when task is running.

The part about the missed index seems to irrelevant, since waitedForCompletion is able to suppress the error and return a snapshot of running task which is not possible if getFinishedTaskFromIndex gets called directly from getRunningTaskFromNode.

arteam avatar May 21 '24 07:05 arteam

@henningandersen Any chance you would be able to get a look at the changes in the PR?

arteam avatar May 24 '24 14:05 arteam

There indeed seems to be a race between creating a get request and unblocking request which are sent asynchronously

Did you manage to reproduce this by putting in a sleep somewhere? I'd like to fully understand the situation.

henningandersen avatar May 27 '24 09:05 henningandersen

@henningandersen Yes, the error is reproduced trivially if you unblock the request first and add a small delay before calling the waitForCompletion request.

 // Unblock the request so the wait for completion request can finish
client().execute(UNBLOCK_TASK_ACTION, new TestTaskPlugin.UnblockTestTasksRequest()).get();
Thread.sleep(1000);
// Spin up a request to wait for the test task to finish
waitResponseFuture = wait.apply(taskId);

arteam avatar May 27 '24 12:05 arteam

trivially if you unblock the request first and add a small delay before calling the waitForCompletion request.

I am not sure I understand why it would be an ok reproduction to swap the order of unblock and wait here, can you elaborate? Is it possible to just add a sleep somewhere else to see it fail?

henningandersen avatar May 27 '24 12:05 henningandersen

I am not sure I understand why it would be an ok reproduction to swap the order of unblock and wait here, can you elaborate? Is it possible to just add a sleep somewhere else to see it fail?

@henningandersen I believe the issue is that order is undefined since both operations are run asynchronously. We do not check that the request clusterAdmin().prepareGetTask(id).setWaitForCompletion(true) is finished before unblocking the task. We just get a ActionFuture and immediately call client().execute(UNBLOCK_TASK_ACTION, new TestTaskPlugin.UnblockTestTasksRequest()).get().

So, depending on a race which of one these requests will be processed first, we will get a different result. That's why taskManager.getTask(request.getTaskId().getId()) in TransportGetTaskAction can return null which happens if the unlock request manages to win the race.

arteam avatar May 27 '24 13:05 arteam

Thanks, that makes sense. I would have hoped we could put in a simple sleep somewhere to provoke it but I were not successful on that yet.

henningandersen avatar May 27 '24 13:05 henningandersen

I can reproduce this using -Dtests.seed=F52B12BE60A068C8 and a sleep at the beginning of getRunningTaskFromNode.

henningandersen avatar May 28 '24 06:05 henningandersen

@elasticmachine update branch

arteam avatar May 28 '24 07:05 arteam

Thank you!

arteam avatar May 29 '24 07:05 arteam