hdfs-deprecated icon indicating copy to clipboard operation
hdfs-deprecated copied to clipboard

journalnode - fails to start / is not tried on another node

Open samek opened this issue 9 years ago • 3 comments

When deploying the framework if journal node fails to start I'll continue to try starting it on the same node.

So If a node has a problem of some kind It will never start. I think we should try to start it somewhere else if it fails X times.

samek avatar Jun 18 '15 08:06 samek

Hi samek , I don't quite understand your issue. Can you make it clear? In my understanding, If the scheduler is working but a slave with journal node is lost, then the scheduler will try to re-launch journal node in that lost slave within a timeout (get from "mesos.hdfs.deadnode.timeout.seconds", default 90s). If the timeout expires, the scheduler will use another slave to launch journal node. right? @elingg

tangzhankun avatar Jun 18 '15 15:06 tangzhankun

When i started the hdfs framework it makes 3 journal nodes. In my case it happened that one of those nodes could not download the tgz package (network issue) and it was restarting indefinitely on the same host and failing.

Sent from my iPhone

On 18 Jun 2015, at 17:09, tangzhankun [email protected] wrote:

Hi samek , I don't quite understand your issue. Can you make it clear? In my understanding, If the scheduler is working but a slave with journal node is lost, then the scheduler will try to re-launch journal node in that lost slave within a timeout (get from "mesos.hdfs.deadnode.timeout.seconds", default 90s). If the timeout expires, the scheduler will use another slave to launch journal node. right? @elingg

— Reply to this email directly or view it on GitHub.

samek avatar Jun 18 '15 15:06 samek

Thanks for reporting this @samek! Yes, correct, @tangzhankun, If the scheduler is working, but a slave with journal node is lost, then the scheduler will try to re-launch journal node in that lost slave within a timeout (get from "mesos.hdfs.deadnode.timeout.seconds", default 90s). If the timeout expires, the scheduler will use another slave to launch the journal node. In the case of repeated failure (restarting over and over and failing each time), this will certainly cause a problem.

elingg avatar Jun 18 '15 16:06 elingg