hdfs-deprecated
hdfs-deprecated copied to clipboard
Error bootstrapping 2nd NN when starting up
I noticed this in the log. The framework stopped attempted to bootstrap the 2nd namenode. It looks like if there is a task lost then running it doesn't update livestate or maybe just not fast enough. I should note I don't know why that namenode task became lost.
12:46:39.263 [Thread-50] INFO org.apache.mesos.hdfs.Scheduler - Received status update for taskId=task.namenode.namenode.NameNodeExecutor.1426268773920 state=TASK_RUNNING message='-i' stagingTasks.size=0
12:46:39.264 [Thread-50] INFO org.apache.mesos.hdfs.Scheduler - Current Acquisition Phase: FORMAT_NAME_NODES
12:46:39.264 [Thread-50] INFO org.apache.mesos.hdfs.Scheduler - Sending message '-b' to taskId=task.namenode.namenode.NameNodeExecutor.1426268774910, slaveId=20150311-133327-169978048-5050-2699-S2
12:46:39.949 [Thread-51] INFO org.apache.mesos.hdfs.Scheduler - Received 3 offers
12:46:40.950 [Thread-52] INFO org.apache.mesos.hdfs.Scheduler - Received 1 offers
12:46:44.961 [Thread-53] INFO org.apache.mesos.hdfs.Scheduler - Received 3 offers
12:46:45.963 [Thread-54] INFO org.apache.mesos.hdfs.Scheduler - Received 1 offers
12:46:49.979 [Thread-55] INFO org.apache.mesos.hdfs.Scheduler - Received 3 offers
12:46:50.980 [Thread-56] INFO org.apache.mesos.hdfs.Scheduler - Received 1 offers
12:46:53.454 [Thread-57] INFO org.apache.mesos.hdfs.Scheduler - Received status update for taskId=task.namenode.namenode.NameNodeExecutor.1426268774910 state=TASK_LOST message='Executor terminated' stagingTasks.size=0
12:46:53.468 [Thread-58] INFO org.apache.mesos.hdfs.Scheduler - Received status update for taskId=task.zkfc.namenode.NameNodeExecutor.1426268774910 state=TASK_LOST message='Executor terminated' stagingTasks.size=0
12:46:54.987 [Thread-59] INFO org.apache.mesos.hdfs.Scheduler - Received 4 offers
12:46:54.989 [Thread-59] INFO org.apache.mesos.hdfs.Scheduler - Launching node of type namenode with tasks [namenode, zkfc]
Saving the name node mesos-slave3 task.namenode.namenode.NameNodeExecutor.1426268814989
12:46:58.329 [Thread-60] INFO org.apache.mesos.hdfs.Scheduler - Received status update for taskId=task.namenode.namenode.NameNodeExecutor.1426268814989 state=TASK_RUNNING message='' stagingTasks.size=2
12:46:58.330 [Thread-60] INFO org.apache.mesos.hdfs.Scheduler - Current Acquisition Phase: START_NAME_NODES
12:46:58.330 [Thread-60] INFO org.apache.mesos.hdfs.Scheduler - Sending message 'reload config' to taskId=task.journalnode.journalnode.NodeExecutor.1426268761923, slaveId=20150311-133327-169978048-5050-2699-S3
12:46:58.330 [Thread-60] INFO org.apache.mesos.hdfs.Scheduler - Sending message 'reload config' to taskId=task.journalnode.journalnode.NodeExecutor.1426268762929, slaveId=20150311-133327-169978048-5050-2699-S2
12:46:58.330 [Thread-60] INFO org.apache.mesos.hdfs.Scheduler - Sending message 'reload config' to taskId=task.journalnode.journalnode.NodeExecutor.1426268767883, slaveId=20150311-133327-169978048-5050-2699-S1
12:46:58.330 [Thread-60] INFO org.apache.mesos.hdfs.Scheduler - Sending message 'reload config' to taskId=task.namenode.namenode.NameNodeExecutor.1426268773920, slaveId=20150311-133327-169978048-5050-2699-S3
12:46:58.331 [Thread-60] INFO org.apache.mesos.hdfs.Scheduler - Sending message 'reload config' to taskId=task.zkfc.namenode.NameNodeExecutor.1426268773920, slaveId=20150311-133327-169978048-5050-2699-S3
12:46:58.331 [Thread-60] INFO org.apache.mesos.hdfs.Scheduler - Sending message 'reload config' to taskId=task.namenode.namenode.NameNodeExecutor.1426268814989, slaveId=20150311-133327-169978048-5050-2699-S2
12:46:58.333 [Thread-61] INFO org.apache.mesos.hdfs.Scheduler - Received status update for taskId=task.zkfc.namenode.NameNodeExecutor.1426268814989 state=TASK_RUNNING message='' stagingTasks.size=1
12:46:58.335 [Thread-61] INFO org.apache.mesos.hdfs.Scheduler - Current Acquisition Phase: FORMAT_NAME_NODES
12:46:58.336 [Thread-61] INFO org.apache.mesos.hdfs.Scheduler - Sending message '-b' to taskId=task.namenode.namenode.NameNodeExecutor.1426268774910, slaveId=20150311-133327-169978048-5050-2699-S2
Hi @nicgrayson, yes figuring out why the NN task was lost is an important detail that I would like to know. Do you have access to those logs? That said, it should relaunch on another node and bootstrap the second node successfully even if the first task was lost. I will see if I can reproduce this as well.
You can see it relaunched a new NN but sent the bootstrap message to the old taskid
Ah, yes, it seems it needs to update the LiveState appropriately and at the right time. I will see if I can reproduce.