hdfs-deprecated icon indicating copy to clipboard operation
hdfs-deprecated copied to clipboard

Handle failure of scheduler over long period of time or repeated failure of scheduler

Open elingg opened this issue 9 years ago • 14 comments

If the scheduler fails at the same time that a node task dies and stops for a significant period of time (more than 5 - 10 minutes) and then starts up, it will try to recover dead node tasks even though they may have been dead for some time. This also could happen in the case of the scheduler restarting over and over.

elingg avatar Apr 30 '15 23:04 elingg

Zookeeper storage with an in memory cache may be a good solution for this. In fact, refactoring of persistent state to use an in memory cache would be ideal.

elingg avatar Apr 30 '15 23:04 elingg

Hi, @elingg I didn't understand "it will try to recover dead node tasks even though they may have been dead for some time". The persistentstate ( zookeeper state ) may be not consistent with actual state in this situation. But each time we restart the scheduler, it will clear the zookeeper state based on livestate, right? And if a node has dead, then Mesos will not offer the resource of this node to hdfs-mesos scheduler, So why scheduler will try to recover the task in dead node since the ReconcileStateTask has cleared the zookeeper state ?

tangzhankun avatar May 04 '15 08:05 tangzhankun

hi @tangzhankun, With the current implementation, if a node dies it has 1.5 minutes to recover (this time is configurable). If the scheduler fails over and a node has died while the scheduler is down, then when the scheduler starts up, the node is given 1.5 minutes to recover. This works well, except in cases where the scheduler has been down for a long period of time or continually restarts.

elingg avatar May 04 '15 20:05 elingg

hi @elingg Thanks a lot for the explanation! I saw the implementation of this 1.5 minutes timeout. The scheduler will not remove the host record from "journalNodes" until timeout. If the host is back during this 1.5 minutes, the scheduler will re-launch the task on the same host if offer.getHostname() in dead node hashmap. This make sense. But I am still not sure the work flow if the scheduler process is down and restart. Here is my understanding:

  1. the scheduler's run() will execute and then callback registed() will be called.
  2. reconcileTasks() will be called
  3. in reconcileTasks's run() method, persistentState will get all taskId from zookeeper, but liveState.getRunningTasks return an empty hashmap. So the persistentState.removeTaskId(taskId) got executed. then reconcile finished.
  4. Once reconcile finished, the callback resourceOffers will move to JOURNAL_NODES phase. then it will tryToLaunchJournalNode.
  5. If DeadJournalNode is not empty and contains the offer.getHostname, it will call launchNode to launch a new journalnode daemon. If DeadJournalNode is empty, the scheduler will realize that already running 3 journalnodes, so declineOffer.

Based on the steps above, I am confused of the reconcileTasks(). I thought the scheduler should first make the liveState object consistent with the actual zookeeper, like member "runningTasks". Then call the reconcileTasks(). Because if liveState is not sync with zookeeper, the "runningTasks" is empty, how can correctCurrentPhase works well? Above all, my questions are (seems a little more :( ):

  1. How can the liveState be synced with actual situation when scheduler restart? if not synced, everything is fine?
  2. Is my above understanding of the call trace right (step1 to 5)?

tangzhankun avatar May 05 '15 05:05 tangzhankun

Corrected Steps:

  1. the scheduler's run() will execute and then callback registered() or reregistered() will be called.
  2. driver.reconcileTasks() will be called which will send status updates for all running tasks.
  3. in reconcileTasks's run() method, persistentState will get all taskId from zookeepers. Also, liveState.getRunningTasks will return a map with all running tasks since the status updates would have been sent from the tasks since we called driver.reconcile. If there a task that has died that was persisted in zookeeper, we will remove it.
  4. Once reconcile finished, the callback resourceOffers will move to JOURNAL_NODES phase. then it will tryToLaunchJournalNode.
  5. If DeadJournalNode is not empty and contains the offer.getHostname, it will call launchNode to relaunch the journal node on the same node. If DeadJournalNode is empty (i.e. the timeout has elapsed or there are no dead journal nodes), the scheduler will try to launch a journalnode wherever it finds an offer.

LiveState should be reconciled correctly with PersistentState. I corrected the steps above. Please let me know if you need further clarification.

elingg avatar May 05 '15 18:05 elingg

Hi, @elingg Thanks very much. I missed the step 2 that you mentioned. I remembered it called in reconcileTasks(), but didn't know what the "driver.reconcileTasks()" means. But I still don't make the issue clear: If a host (like journalNode) died after scheduler died, then 5-10 mins passed, when the scheduler start again, it will think the died host just died ( because deadJournalNodeTimeStamp newly created in persistentState.removeTaskId(taskId) ), and return it as DeadJournalNode.

  1. Then the DeadJournalNode is not empty, in the next 1.5 mins, this journalNode will not be launched to any host except the one dead before.
  2. But after 1.5 mins expires, the scheduler will remove this host from the deadJournal and launch a journalnode wherever it finds an offer.

Do you think the situation 1 above is the issue?

tangzhankun avatar May 07 '15 00:05 tangzhankun

hi @tangzhankun, yes your description of the issue is correct.

elingg avatar May 07 '15 23:05 elingg

hi @elingg Thanks for quick reply. But this is not a serious bug, right? :) After the 1.5 mins timeout, the scheduler will remove it and accept offer to launch journalnode.

tangzhankun avatar May 08 '15 01:05 tangzhankun

Correct, not serious, but we still want to get rid of these bugs as well!

elingg avatar May 08 '15 02:05 elingg

Hi, @elingg I agree. And I see your "Fix reconciliation features" merged. It clear the dead timestamp when start. Thanks for your reply. And I have one question here: How can I get the API document of mesos? I know the callback that one need to implement. But I saw some APIs used beyond the callbacks. like the method "ExecutorInfo.newBuilder().setName()" and so on.

tangzhankun avatar May 10 '15 12:05 tangzhankun

hi @tangzhankun, I find the protobufs for Mesos very useful in this regard, https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto.

elingg avatar May 11 '15 21:05 elingg

hi @elingg Thanks for the reminder of protocol buffer. The APIs also include MesosSchedulerDriver's "reconcileTasks()" "sendFrameworkMessage" and so on. Although we can understand it form the source code of mesos and it's example, there are no detail document for these APIs' usage, right?

tangzhankun avatar May 12 '15 04:05 tangzhankun

You might also find this link useful for the Java API's, see http://mesos.apache.org/api/latest/java/org/apache/mesos/MesosSchedulerDriver.html.

elingg avatar May 12 '15 18:05 elingg

Hi, @elingg Yes. Thanks very much, elingg. :)

tangzhankun avatar May 13 '15 02:05 tangzhankun