drmaa-python
drmaa-python copied to clipboard
MultiThreading and MUNGE
Hi,
I have happily used drmaa-python for many years with our SGE cluster. Just recently a new cluster was installed, and this time it is configured to use MUNGE security.
If I create and submit a simple job, everything works fine, but if I run the job submission as part of a thread pool I get an error about MUNGE security.
For example:
import drmaa
from multiprocessing.pool import ThreadPool
import tempfile
import os
import stat
pool = ThreadPool(2)
session = drmaa.Session()
session.initialize()
def pTask(n):
smt = "ls . > test.out"
script_file = tempfile.NamedTemporaryFile(mode="w", dir=os.getcwd(), delete=False)
script_file.write(smt)
script_file.close()
print "Job is in file %s" % script_file.name
os.chmod(script_file.name, stat.S_IRWXG | stat.S_IRWXU)
jt = session.createJobTemplate()
print "jt created"
jt.jobEnvironment = {'BASH_ENV': '~/.bashrc'}
print "environment set"
jt.remoteCommand = os.path.join(os.getcwd(),script_file.name)
print "remote command set"
jobid = session.runJob(jt)
print "Job submitted with id: %s, waiting ..." % jobid
retval = session.wait(jobid, drmaa.Session.TIMEOUT_WAIT_FOREVER)
pool.map(pTask, (1,))
produces the following output
Job is in file /home/userid/tmpbRa0IO
jt created
environment set
error: getting configuration: MUNGE authentication failed: Invalid credential format
remote command set
Traceback (most recent call last):
File "test_threads.py", line 31, in <module>
pool.map(pTask, (1,))
File "/home/mb1ims/.conda/envs/sharc/lib/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/home/mb1ims/.conda/envs/sharc/lib/python2.7/multiprocessing/pool.py", line 567, in get
raise self._value
drmaa.errors.DeniedByDrmException: code 17: MUNGE authentication failed: Invalid credential format
so the first sign of trouble is when jt.remoteCommand
is set, but the script continues and gives an unhandled python error when session.runJob
is executed.
Not familiar with MUNGE, but guessing only one thread can submit a job at a time due to how MUNGE does validation. Have you tried using a lock around job submission?
Alternatively have you tried using a job array for submission? This would be one request that would allow you to submit multiple jobs.
At the end of the day, am guessing this will require a conversation with your Cluster's Admins to understand how they have configured this security protocol and what qualifies as acceptable usage. Would be interested to hear the results of that conversation and whether there are things DRMAA could do to make it easier to use for this case.
Has there been any more discussion on this issue? I have an application that uses drmaa to submit jobs to an sge cluster and I am getting invalid credential format error in munge. I have isolated the issue to drmaa's submit job function. Don't know what to do from here. The munge developer claimed that error indicates that the munge credential is getting truncated.
Not here. The only place we got was to confirm that this isn't a problem with the java drmaa library - we can submit jobs from multiple threads in java no problem.
Since all our application does is submit drmaa jobs, we've moved away from a multi-threading paradigm and use a different structure to manage the different job streams to get around the problem.