hdfs-deprecated icon indicating copy to clipboard operation
hdfs-deprecated copied to clipboard

Error bootstrapping 2nd NN when starting up

Open nicgrayson opened this issue 9 years ago • 3 comments

I noticed this in the log. The framework stopped attempted to bootstrap the 2nd namenode. It looks like if there is a task lost then running it doesn't update livestate or maybe just not fast enough. I should note I don't know why that namenode task became lost.

12:46:39.263 [Thread-50] INFO  org.apache.mesos.hdfs.Scheduler - Received status update for taskId=task.namenode.namenode.NameNodeExecutor.1426268773920 state=TASK_RUNNING message='-i' stagingTasks.size=0
12:46:39.264 [Thread-50] INFO  org.apache.mesos.hdfs.Scheduler - Current Acquisition Phase: FORMAT_NAME_NODES
12:46:39.264 [Thread-50] INFO  org.apache.mesos.hdfs.Scheduler - Sending message '-b' to taskId=task.namenode.namenode.NameNodeExecutor.1426268774910, slaveId=20150311-133327-169978048-5050-2699-S2
12:46:39.949 [Thread-51] INFO  org.apache.mesos.hdfs.Scheduler - Received 3 offers
12:46:40.950 [Thread-52] INFO  org.apache.mesos.hdfs.Scheduler - Received 1 offers
12:46:44.961 [Thread-53] INFO  org.apache.mesos.hdfs.Scheduler - Received 3 offers
12:46:45.963 [Thread-54] INFO  org.apache.mesos.hdfs.Scheduler - Received 1 offers
12:46:49.979 [Thread-55] INFO  org.apache.mesos.hdfs.Scheduler - Received 3 offers
12:46:50.980 [Thread-56] INFO  org.apache.mesos.hdfs.Scheduler - Received 1 offers
12:46:53.454 [Thread-57] INFO  org.apache.mesos.hdfs.Scheduler - Received status update for taskId=task.namenode.namenode.NameNodeExecutor.1426268774910 state=TASK_LOST message='Executor terminated' stagingTasks.size=0
12:46:53.468 [Thread-58] INFO  org.apache.mesos.hdfs.Scheduler - Received status update for taskId=task.zkfc.namenode.NameNodeExecutor.1426268774910 state=TASK_LOST message='Executor terminated' stagingTasks.size=0
12:46:54.987 [Thread-59] INFO  org.apache.mesos.hdfs.Scheduler - Received 4 offers
12:46:54.989 [Thread-59] INFO  org.apache.mesos.hdfs.Scheduler - Launching node of type namenode with tasks [namenode, zkfc]
Saving the name node mesos-slave3 task.namenode.namenode.NameNodeExecutor.1426268814989
12:46:58.329 [Thread-60] INFO  org.apache.mesos.hdfs.Scheduler - Received status update for taskId=task.namenode.namenode.NameNodeExecutor.1426268814989 state=TASK_RUNNING message='' stagingTasks.size=2
12:46:58.330 [Thread-60] INFO  org.apache.mesos.hdfs.Scheduler - Current Acquisition Phase: START_NAME_NODES
12:46:58.330 [Thread-60] INFO  org.apache.mesos.hdfs.Scheduler - Sending message 'reload config' to taskId=task.journalnode.journalnode.NodeExecutor.1426268761923, slaveId=20150311-133327-169978048-5050-2699-S3
12:46:58.330 [Thread-60] INFO  org.apache.mesos.hdfs.Scheduler - Sending message 'reload config' to taskId=task.journalnode.journalnode.NodeExecutor.1426268762929, slaveId=20150311-133327-169978048-5050-2699-S2
12:46:58.330 [Thread-60] INFO  org.apache.mesos.hdfs.Scheduler - Sending message 'reload config' to taskId=task.journalnode.journalnode.NodeExecutor.1426268767883, slaveId=20150311-133327-169978048-5050-2699-S1
12:46:58.330 [Thread-60] INFO  org.apache.mesos.hdfs.Scheduler - Sending message 'reload config' to taskId=task.namenode.namenode.NameNodeExecutor.1426268773920, slaveId=20150311-133327-169978048-5050-2699-S3
12:46:58.331 [Thread-60] INFO  org.apache.mesos.hdfs.Scheduler - Sending message 'reload config' to taskId=task.zkfc.namenode.NameNodeExecutor.1426268773920, slaveId=20150311-133327-169978048-5050-2699-S3
12:46:58.331 [Thread-60] INFO  org.apache.mesos.hdfs.Scheduler - Sending message 'reload config' to taskId=task.namenode.namenode.NameNodeExecutor.1426268814989, slaveId=20150311-133327-169978048-5050-2699-S2
12:46:58.333 [Thread-61] INFO  org.apache.mesos.hdfs.Scheduler - Received status update for taskId=task.zkfc.namenode.NameNodeExecutor.1426268814989 state=TASK_RUNNING message='' stagingTasks.size=1
12:46:58.335 [Thread-61] INFO  org.apache.mesos.hdfs.Scheduler - Current Acquisition Phase: FORMAT_NAME_NODES
12:46:58.336 [Thread-61] INFO  org.apache.mesos.hdfs.Scheduler - Sending message '-b' to taskId=task.namenode.namenode.NameNodeExecutor.1426268774910, slaveId=20150311-133327-169978048-5050-2699-S2

nicgrayson avatar Mar 13 '15 17:03 nicgrayson

Hi @nicgrayson, yes figuring out why the NN task was lost is an important detail that I would like to know. Do you have access to those logs? That said, it should relaunch on another node and bootstrap the second node successfully even if the first task was lost. I will see if I can reproduce this as well.

elingg avatar Mar 13 '15 17:03 elingg

You can see it relaunched a new NN but sent the bootstrap message to the old taskid

nicgrayson avatar Mar 13 '15 17:03 nicgrayson

Ah, yes, it seems it needs to update the LiveState appropriately and at the right time. I will see if I can reproduce.

elingg avatar Mar 13 '15 18:03 elingg