ipyparallel icon indicating copy to clipboard operation
ipyparallel copied to clipboard

HPC Cluster Problems

Open Pas0691 opened this issue 4 years ago • 1 comments

Hey guys very very cool job so far.

I'm not quite sure if that's a hugh issue, but I wasn't able to find a solution by myself.

Goal: I want to implement a pythoncluster on a Windows HPC Cluster

Installed SW: Windows Server 2012 on the Head, HPC Pack 2016 as managment, and Anaconda for management of python.

What I have done so far: Installed all ipcluster dependencies and made a cluster ( ipcluster start -n 2) working without issues. I did not establish connections to any engines yet. I thought that would minimize fault potentials.

Anyway when I'm trying to use the WindowsHPC controller, The cluster does not start up, but fails with:

Traceback (most recent call last): File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\ipyparallel\apps\ipclusterapp.py", line 543, in start_controller self.controller_launcher.start() File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\ipyparallel\apps\launcher.py", line 973, in start return super(WindowsHPCControllerLauncher, self).start(1) File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\ipyparallel\apps\launcher.py", line 914, in start output = check_output([self.job_cmd] + args, File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\subprocess.py", line 411, in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\subprocess.py", line 512, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['C:\Program Files\Microsoft HPC Pack 2016\Bin\job.EXE', 'submit', '/jobfile:C:\Users\xxx\.ipython\profile_default\ipcontroller_job.xml', '/scheduler:']' returned non-zero exit status 1. ERROR:tornado.application:Exception in callback functools.partial(<function IPClusterStart.start..start at 0x000000B31324A670>) Traceback (most recent call last): File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\tornado\ioloop.py", line 743, in _run_callback ret = callback() File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\ipyparallel\apps\ipclusterapp.py", line 588, in start self.start_controller() File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\ipyparallel\apps\ipclusterapp.py", line 543, in start_controller self.controller_launcher.start() File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\ipyparallel\apps\launcher.py", line 973, in start return super(WindowsHPCControllerLauncher, self).start(1) File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\site-packages\ipyparallel\apps\launcher.py", line 914, in start output = check_output([self.job_cmd] + args, File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\subprocess.py", line 411, in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, File "C:\ProgramData\Anaconda3\envs\pythoncluster\lib\subprocess.py", line 512, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['C:\Program Files\Microsoft HPC Pack 2016\Bin\job.EXE', 'submit', '/jobfile:C:\Users\xxx\.ipython\profile_default\ipcontroller_job.xml', '/scheduler:']' returned non-zero exit status 1.

I thought about wrong paths, but unfortunatly this wasn't a problem. I guess the problem isn't that big but I couldn't dig to the source. I tried to highlight the most intersting part of the message.

Pas0691 avatar Nov 12 '20 12:11 Pas0691

Hi! I’m going through and cleaning up old/stale issues on this repo. Sorry for not responding in a reasonable amount of time!

Can you run the job submit command yourself (outside ipcluster) and maybe get better feedback from there? IPCluster has a habit of hiding the useful errors from the underlying system, but the generated ipconroller_job.xml should still exist after it failed to submit it.

minrk avatar Jun 04 '21 13:06 minrk