magpie icon indicating copy to clipboard operation
magpie copied to clipboard

RDMA Hadoop/Spark not working with Slurm submission scripts

Open casty8 opened this issue 6 years ago • 3 comments

I have configured rdma hadoop and spark by myself in an InfiniBand cluster and it works, but when I try to use the submission script magpie.sbatch-srun-spark-with-yarn-and-hdfs (just for testing hadoop by now), it allocates the nodes perfectly in slurm but doesn't work properly. ResourceManager appears on jps command but it doesn't start, showing and InfiniBand error in the resourcemanager.out error, while not showing errors in de .log file, so nodemanagers.log files show a connection problem to the resource manager node.

Seems like these scripts are not ready for this RDMA version of hadoop and spark, because I can make it work fine by myself with the conf files provided in the hadoop guide that I followed, any suggestions??

I would really appreciate any help you can provide.

casty8 avatar Mar 16 '18 09:03 casty8

I have never tested with RDMA hadoop, so I don't know if Magpie works with it. Obviously, any number of changes to RDMA hadoop can make it not work with Magpie, as Magpie assumes the hadoop scripts work in a certain way, the patches apply to it cleanly, the same configuration & tool options exist, etc. etc.

Without any knowledge of your situation, here's a guess on the problem.

Magpie assumes the node's hostname (as it is configured in Slurm), such as "foo[1-10]" is the hostname to use for networking communication. i.e. the Hadoop nodemanager works off the node & port of foo1:1234 and connects to the datanode on foo2:5678.

If your cluster is not like this, then perhaps the Infiniband portion of RDMA Hadoop is confused, b/c its trying to connect to the host/IP that Magpie configured for it, which is not the host/IP it wants.

Other than that, I think good ole fashioned log/conf file debugging is the way to go. I'm glad to help. If you use the script magpie-gather-config-files-and-logs-script.sh, it's a good way to gather conf/log files to begin debugging with.

chu11 avatar Mar 16 '18 17:03 chu11

I'm trying to launch the job with the script that you recommended me but I'm getting some errors like these in the slurm-jobid.out file: magpie-output-config-files-script.sh: 10: [: 0: unexpected operator magpie-gather-config-files-and-logs-script.sh: 29: [: y: unexpected operator

Also showing this one: Magpie Internal Error: Magpie_get_networkedhdfspath called without HDFS networked path set used

I have tried to modify some config files provided by Magpie adding some options used in my own files, but it still doesn't work.

casty8 avatar Mar 21 '18 10:03 casty8

I'm unsure of your setup, but there seems to be something core/basic wrong. Unclear of what it could be.

For

 magpie-output-config-files-script.sh: 10: [: 0: unexpected operator

the error is this line

if [ "${MAGPIE_CLUSTER_NODERANK}" == "0" ]                                                               

The environment variable MAGPIE_CLUSTER_NODERANK isn't defined, leading to the script error. But this is defined by Magpie in magpie/exports/magpie-exports-submission-type. So some error is occurring that isn't causing that environment variable to be generated. This is a pretty core part of Magpie and it probably means your setup is unique in some way that it can't calculate your job's noderank.

Perhaps you can try a stupid test. If you run a job, can you output the environment variables SLURM_NODEID, SLURM_NNODES, SLURM_JOB_NODELIST, SLURM_JOB_NAME, and SLURM_JOB_ID on each node of your allocation? Because Magpie needs these, and I believe at the moment it simply assumes Slurm always provides them.

chu11 avatar Mar 22 '18 00:03 chu11