hdfs-deprecated icon indicating copy to clipboard operation
hdfs-deprecated copied to clipboard

Stuck in namenode or journalnode phase

Open mheppner opened this issue 9 years ago • 12 comments

On a fresh install of DC/OS v1.7-open, HDFS 2.5.2-0.1.9 does not launch with the default configuration. We have a cluster of 10 machines, each running CentOS 7 with 4 cores and 16 GB of RAM. HDFS will launch but will only get to either a namenode phase or a journalnode phase and does not proceed any further. SELinux has been disabled and all firewalls turned off. mesos-dns and zookeeper appear to be running. All prereqs are met and nothing obvious seems to be happening in the error log.

Possible duplicate of #225 or #242.

mheppner avatar May 11 '16 14:05 mheppner

it would be helpful to stdout log

kensipe avatar May 11 '16 15:05 kensipe

Sorry, nothing appears in stdout for quite some time. Here's the updated stdout log.

mheppner avatar May 11 '16 15:05 mheppner

also curl <ip_of_scheduler>:<port_of_scheduler>/v1/plan/ and provide output. the default port is 8765 but is likely remapped by marathon.

also... can you tell if it is launching NN2 tasks that are dying or if it never launches another task.

kensipe avatar May 11 '16 15:05 kensipe

also curl :/v1/plan/ and provide output.

Is this part of Marathon or Mesos? Using this on Mesos at /mesos/api/v1/plan gives no output and on Marathon at /service/marathon/v1/plan returns a 404.

also... can you tell if it is launching NN2 tasks that are dying or if it never launches another task.

From previous logs, it looked like it kept failing to launch the namenodes, but it wasn't clear to me why. As of a few minutes ago, the health check deemed it as dead, bounced it over to another machine and is now stuck in journalnode phase. The error logs are just a bunch of connections to zookeeper, followed by the connections being closed.

12:07:25.508 [Thread-154] INFO  o.apache.mesos.offer.OfferEvaluator - EnoughCPU: false EnoughMem: false EnoughDisk: true EnoughPorts: true HasExpectedVolumes: false
12:07:25.508 [Thread-154] WARN  o.apache.mesos.offer.OfferEvaluator - No Offers found meeting Resource constraints.
12:07:25.508 [Thread-154] WARN  o.apache.mesos.offer.OfferEvaluator - No acceptable offers due to insufficient resources.

Is it something to do with this?

mheppner avatar May 11 '16 16:05 mheppner

if you are using dcos the url is <dcos_url>/service/hdfs/v1/plan/

otherwise I specifically mean the location of the hdfs scheduler which can be seen in marathon.

kensipe avatar May 11 '16 16:05 kensipe

It's now in "running" phase, which I was never able to get to in the past week or so. I'm not sure what changed. Health checks are still failing and Marathon still reports it as "deploying."

Here's the output of /v1/plan

mheppner avatar May 11 '16 16:05 mheppner

This very strange output... I would love to understand more. The namenode 2 is InProgress but the full status is Complete. It isn't possible for DN to be complete without the NN2 completing at some point and being DNS resolvable. So at some point that was true.

It is more interesting that marathon is "deploying" that is likely true because the health checks are failing.. after 3 failures it will assume it is bad... kill the current one and start a new one. You say this is repeatable? What are the steps? The only difference I noticed is that you are on CentOs and all my testing was on our EC2, CoreOs instances. It hard to imagine that causing the problem... any other differences that we should be aware of?

kensipe avatar May 11 '16 16:05 kensipe

is it possible to gain access to your cluster?

kensipe avatar May 11 '16 16:05 kensipe

My coworker also confirmed this on EC2 (I think they're using CoreOS) with the default hdfs installation. The steps are just to hit the install button for hdfs in the DC/OS Universe tab. I've followed the uninstall steps and retried the default installation multiple times, always resulting in the same situation. I've also tried doing an "advanced installation" in DC/OS by upping the CPU and RAM for the Journal, Name, and Data nodes, but nothing behaves differently. I've also tried installing hdfs using the DCOS CLI, as per the Mesosphere docs.

The cluster isn't publicly accessible right now but I can look into what it would take to pass it through the firewall.

mheppner avatar May 11 '16 18:05 mheppner

Looks like this might be a duplicate of #258 but will require more research

jgarcia-mesosphere avatar May 20 '16 00:05 jgarcia-mesosphere

I think you meant #257, it looks pretty similar.

mheppner avatar May 20 '16 18:05 mheppner

@malnick We have maybe the same problem in #262. Your log says for example: 11:50:37.840 [Thread-223] INFO o.apache.mesos.offer.OfferEvaluator - EnoughCPU: false EnoughMem: true EnoughDisk: true EnoughPorts: true HasExpectedVolumes: false 11:50:37.840 [Thread-223] INFO o.apache.mesos.offer.OfferEvaluator - EnoughCPU: false EnoughMem: false EnoughDisk: true EnoughPorts: true HasExpectedVolumes: true 11:50:37.840 [Thread-223] INFO o.apache.mesos.offer.OfferEvaluator - EnoughCPU: false EnoughMem: false EnoughDisk: true EnoughPorts: true HasExpectedVolumes: false

so that means that the offers have insufficient resources.

todormazgalov avatar Aug 17 '16 11:08 todormazgalov