hdfs-deprecated Error bootstrapping 2nd NN when starting up

I noticed this in the log. The framework stopped attempted to bootstrap the 2nd namenode. It looks like if there is a task lost then running it doesn't update livestate or maybe just not fast enough. I should note I don't know why that namenode task became lost.

12:46:39.263 [Thread-50] INFO  org.apache.mesos.hdfs.Scheduler - Received status update for taskId=task.namenode.namenode.NameNodeExecutor.1426268773920 state=TASK_RUNNING message='-i' stagingTasks.size=0
12:46:39.264 [Thread-50] INFO  org.apache.mesos.hdfs.Scheduler - Current Acquisition Phase: FORMAT_NAME_NODES
12:46:39.264 [Thread-50] INFO  org.apache.mesos.hdfs.Scheduler - Sending message '-b' to taskId=task.namenode.namenode.NameNodeExecutor.1426268774910, slaveId=20150311-133327-169978048-5050-2699-S2
12:46:39.949 [Thread-51] INFO  org.apache.mesos.hdfs.Scheduler - Received 3 offers
12:46:40.950 [Thread-52] INFO  org.apache.mesos.hdfs.Scheduler - Received 1 offers
12:46:44.961 [Thread-53] INFO  org.apache.mesos.hdfs.Scheduler - Received 3 offers
12:46:45.963 [Thread-54] INFO  org.apache.mesos.hdfs.Scheduler - Received 1 offers
12:46:49.979 [Thread-55] INFO  org.apache.mesos.hdfs.Scheduler - Received 3 offers
12:46:50.980 [Thread-56] INFO  org.apache.mesos.hdfs.Scheduler - Received 1 offers
12:46:53.454 [Thread-57] INFO  org.apache.mesos.hdfs.Scheduler - Received status update for taskId=task.namenode.namenode.NameNodeExecutor.1426268774910 state=TASK_LOST message='Executor terminated' stagingTasks.size=0
12:46:53.468 [Thread-58] INFO  org.apache.mesos.hdfs.Scheduler - Received status update for taskId=task.zkfc.namenode.NameNodeExecutor.1426268774910 state=TASK_LOST message='Executor terminated' stagingTasks.size=0
12:46:54.987 [Thread-59] INFO  org.apache.mesos.hdfs.Scheduler - Received 4 offers
12:46:54.989 [Thread-59] INFO  org.apache.mesos.hdfs.Scheduler - Launching node of type namenode with tasks [namenode, zkfc]
Saving the name node mesos-slave3 task.namenode.namenode.NameNodeExecutor.1426268814989
12:46:58.329 [Thread-60] INFO  org.apache.mesos.hdfs.Scheduler - Received status update for taskId=task.namenode.namenode.NameNodeExecutor.1426268814989 state=TASK_RUNNING message='' stagingTasks.size=2
12:46:58.330 [Thread-60] INFO  org.apache.mesos.hdfs.Scheduler - Current Acquisition Phase: START_NAME_NODES
12:46:58.330 [Thread-60] INFO  org.apache.mesos.hdfs.Scheduler - Sending message 'reload config' to taskId=task.journalnode.journalnode.NodeExecutor.1426268761923, slaveId=20150311-133327-169978048-5050-2699-S3
12:46:58.330 [Thread-60] INFO  org.apache.mesos.hdfs.Scheduler - Sending message 'reload config' to taskId=task.journalnode.journalnode.NodeExecutor.1426268762929, slaveId=20150311-133327-169978048-5050-2699-S2
12:46:58.330 [Thread-60] INFO  org.apache.mesos.hdfs.Scheduler - Sending message 'reload config' to taskId=task.journalnode.journalnode.NodeExecutor.1426268767883, slaveId=20150311-133327-169978048-5050-2699-S1
12:46:58.330 [Thread-60] INFO  org.apache.mesos.hdfs.Scheduler - Sending message 'reload config' to taskId=task.namenode.namenode.NameNodeExecutor.1426268773920, slaveId=20150311-133327-169978048-5050-2699-S3
12:46:58.331 [Thread-60] INFO  org.apache.mesos.hdfs.Scheduler - Sending message 'reload config' to taskId=task.zkfc.namenode.NameNodeExecutor.1426268773920, slaveId=20150311-133327-169978048-5050-2699-S3
12:46:58.331 [Thread-60] INFO  org.apache.mesos.hdfs.Scheduler - Sending message 'reload config' to taskId=task.namenode.namenode.NameNodeExecutor.1426268814989, slaveId=20150311-133327-169978048-5050-2699-S2
12:46:58.333 [Thread-61] INFO  org.apache.mesos.hdfs.Scheduler - Received status update for taskId=task.zkfc.namenode.NameNodeExecutor.1426268814989 state=TASK_RUNNING message='' stagingTasks.size=1
12:46:58.335 [Thread-61] INFO  org.apache.mesos.hdfs.Scheduler - Current Acquisition Phase: FORMAT_NAME_NODES
12:46:58.336 [Thread-61] INFO  org.apache.mesos.hdfs.Scheduler - Sending message '-b' to taskId=task.namenode.namenode.NameNodeExecutor.1426268774910, slaveId=20150311-133327-169978048-5050-2699-S2

Mar 13 '15 17:03 nicgrayson

Hi @nicgrayson, yes figuring out why the NN task was lost is an important detail that I would like to know. Do you have access to those logs? That said, it should relaunch on another node and bootstrap the second node successfully even if the first task was lost. I will see if I can reproduce this as well.

Mar 13 '15 17:03 elingg

You can see it relaunched a new NN but sent the bootstrap message to the old taskid

Mar 13 '15 17:03 nicgrayson

Ah, yes, it seems it needs to update the LiveState appropriately and at the right time. I will see if I can reproduce.

Mar 13 '15 18:03 elingg

hdfs-deprecated hdfs-deprecated copied to clipboard

Error bootstrapping 2nd NN when starting up

hdfs-deprecated
hdfs-deprecated copied to clipboard