hdfs-deprecated icon indicating copy to clipboard operation
hdfs-deprecated copied to clipboard

Reconciliation should handle updates from unknown tasks

Open adam-mesos opened this issue 10 years ago • 4 comments

There's a chance that task reconciliation could return a status update for a task that the scheduler (and its persistentState) either does not know about, or no longer considers to be running. We should ensure that these untracked tasks are not silently ignored by the scheduler, since we may want to shut down an unknown task, or at least note it as currently running.

adam-mesos avatar Mar 17 '15 00:03 adam-mesos

This should be fixed with the latest PR for reconciliation that was merged. Feel free to reopen if there are remaining issues (cc @gabrielhartmann )

elingg avatar Sep 04 '15 22:09 elingg

@gabrielhartmann Did you write a test that proves that the above condition (implicit reconciliation returns status for an unknown/terminated task) is handled correctly? We wouldn't want unknown tasks to continue running without the scheduler ever realizing it. Reopening until I see proof of a test that validates this use case. Ideally a commit link, but I would also accept confirmation that it was tested/considered somehow.

adam-mesos avatar Sep 09 '15 03:09 adam-mesos

@adam-mesos: Here's the line where the case of a notification for an unknown Task is noted during reconciliation. Since we now do both Implicit and Explicit reconciliation we receive duplicate status updates for all Tasks. This is handled in the sense that duplicate or unexpected Task updates are ignored as part of reconciliation. They are not ignored from the state machine's perspective. The LiveState still tracks all status updates and will make appropriate state transition decisions if the timing is reasonable.

However, there is definitely a race condition possibility which could leave an orphaned Task. For example, if a Journal Node status update arrives for an unknown (not duplicate) Journal Task after the Journal Node creation phase state of the state machine has been exited, then a warning will be logged, but the extra Task won't be killed.

I'm not sure where this leaves this issue. You write, "We should ensure that these untracked tasks are not silently ignored by the scheduler, since we may want to shut down an unknown task, or at least note it as currently running." We certainly note all cases where an unknown task is encountered, however I think the better course of action would be to correct the state of the system such that it agrees with the desired state. That is, we should handle unexpected status updates either by ignoring duplicates when appropriate or killing tasks when they exceed the desired task/node count.

gabrielhartmann avatar Sep 09 '15 03:09 gabrielhartmann

Okay, I'm glad we're at least logging it, but what do we really want to do if we see an update for an unknown running task? As you suggested, we could end up with more JNs/NNs than desired, and it might be appropriate to kill one.

adam-mesos avatar Sep 12 '15 06:09 adam-mesos