elasticsearch
elasticsearch copied to clipboard
Fix TasksIT#testGetTaskWaitForCompletionWithoutStoringResult
Make sure the .tasks
index is created before we starting testing task completion without storing its result. To achieve that, we store a fake task before we start waitForCompletionTestCase
.
It seems that the failure (the missed index) has always existed in the test scenario and it's supposed to be handled by TransportGetTaskAction.java. We catch IndexNotFoundException
here and convert it to ResourceNotFoundException
. Then we catch ResourceNotFoundException
here and return a snapshot of a task as a response.
So, creating the index explicitly seems to be the right solution, so we can keep the test focused on handling waiting for completion, rather than handling of the missed index/task for which there are other tests.
Resolves #107823
Pinging @elastic/es-distributed (Team:Distributed)
@elasticmachine update branch
@elasticmachine update branch
The linked issue says that the tasks index got deleted, but that does not seem to match the resolution here? Can we find out why the tasks index was deleted too soon instead?
@henningandersen I believe the comment in the linked issue is wrong. The index was never deleted, because the test doesn't create the index. The test waits for the a completion of a task and the tasks only completes, because we have special error handling for the case where the index doesn't exist. I guess in some cases the error handling doesn't can't figure out that the root cause was IndexNotFoundException
which should be converted to ResourceNotFoundException
which is silently ignored.
I believe we shoud just explicitly create the index, because testGetTaskWaitForCompletionWithoutStoringResult
is supposed to test task completion, not the error handling for missed indexes which is done in testGetTaskNotFound
and testTasksGetWaitForNoTask
.
@arteam it still smells like we might be covering up for a bug here. AFAICS, we expect the logic to work regardless of whether the index exists or not. Can you elaborate on how the test differentiates between whether the task exists or not? Since it if it is within the actual tasks code, we may want to target that instead (as well as add a dedicated test for it).
On Wed, May 15, 2024 at 3:02 PM Artem Prigoda @.***> wrote:
Started digging more deeply and the test stopped failing after #108052 https://github.com/elastic/elasticsearch/pull/108052 got merged
I'm pretty sure #108052 had no effect here, it was a pure refactoring.
Message ID: @.***>
I'm pretty sure #108052 had no effect here, it was a pure refactoring.
Sorry about that! I deleted my comment right I realized that #108052 indeed just removed dead code, I was confused by the line numbers in the stack trace.
Still, the only way can I see the test failing is ExceptionsHelper.unwrap(e, ResourceNotFoundException.class)
returning null. In fact, if I replace it with if (false)
the error stack trace seems exactly like the one in the issue. Not sure how it is possible, though.
@elasticmachine update branch
The main problem seems to be that the test case does not find the task running, see this part of the stack trace:
at org.elasticsearch.action.admin.cluster.node.tasks.get.TransportGetTaskAction.getRunningTaskFromNode(TransportGetTaskAction.java:140)
which is this line.
This is where the focus should go I think. The test ran in less than 50ms, so it is not something timeout related, rather likely some race. I did a bit of digging but did not find it.
I do notice that the test case is a suite case, which are sometimes disturbed by prior test. I did not find any such evidence though, so might be a red herring.
I notice that the test writes Test task finished on the node
, so the test task was not cancelled either, since then I believe it would not output that.
@elasticmachine update branch
@henningandersen That was a very good catch! getFinishedTaskFromIndex
was called from getRunningTaskFromNode
, not from waitedForCompletion
. There indeed seems to be a race between creating a get request and unblocking request which are sent asynchronously. I've changed waitForCompletionTestCase
to unblock the task only after the request started waiting for the task completion by registering a removal listener. By doing so, we make sure we test the "wait for completion" branch when task is running.
The part about the missed index seems to irrelevant, since waitedForCompletion
is able to suppress the error and return a snapshot of running task which is not possible if getFinishedTaskFromIndex
gets called directly from getRunningTaskFromNode
.
@henningandersen Any chance you would be able to get a look at the changes in the PR?
There indeed seems to be a race between creating a get request and unblocking request which are sent asynchronously
Did you manage to reproduce this by putting in a sleep somewhere? I'd like to fully understand the situation.
@henningandersen Yes, the error is reproduced trivially if you unblock the request first and add a small delay before calling the waitForCompletion
request.
// Unblock the request so the wait for completion request can finish
client().execute(UNBLOCK_TASK_ACTION, new TestTaskPlugin.UnblockTestTasksRequest()).get();
Thread.sleep(1000);
// Spin up a request to wait for the test task to finish
waitResponseFuture = wait.apply(taskId);
trivially if you unblock the request first and add a small delay before calling the waitForCompletion request.
I am not sure I understand why it would be an ok reproduction to swap the order of unblock and wait here, can you elaborate? Is it possible to just add a sleep somewhere else to see it fail?
I am not sure I understand why it would be an ok reproduction to swap the order of unblock and wait here, can you elaborate? Is it possible to just add a sleep somewhere else to see it fail?
@henningandersen I believe the issue is that order is undefined since both operations are run asynchronously. We do not check that the request clusterAdmin().prepareGetTask(id).setWaitForCompletion(true)
is finished before unblocking the task. We just get a ActionFuture
and immediately call client().execute(UNBLOCK_TASK_ACTION, new TestTaskPlugin.UnblockTestTasksRequest()).get()
.
So, depending on a race which of one these requests will be processed first, we will get a different result. That's why taskManager.getTask(request.getTaskId().getId())
in TransportGetTaskAction
can return null which happens if the unlock request manages to win the race.
Thanks, that makes sense. I would have hoped we could put in a simple sleep somewhere to provoke it but I were not successful on that yet.
I can reproduce this using -Dtests.seed=F52B12BE60A068C8
and a sleep at the beginning of getRunningTaskFromNode
.
@elasticmachine update branch
Thank you!