batchtools icon indicating copy to clipboard operation
batchtools copied to clipboard

TORQUE trouble

Open wlandau-lilly opened this issue 8 years ago • 4 comments

I am having trouble running batchtools jobs on a local installation of TORQUE on Ubuntu 16.04. I think TORQUE is working because the following test.pbs produces the expected output.

#PBS -N test
#PBS -l nodes=1:ppn=1
#PBS -l walltime=0:01:00

cd $PBS_O_WORKDIR
touch done.txt
echo "done"

However, all my jobs hang in the E state. For example, the following R script waits indefinitely.

library("batchtools")
cf <- makeClusterFunctionsTORQUE("torque.tmpl") 
reg <- makeRegistry(NA)
reg$cluster.functions <- cf
batchMap(fun = identity, x = 1:4)
submitJobs()
waitForJobs() # waits here indefinitely
reduceResultsList() # not reached

In my case, the console message of wait_for_jobs()

Waiting (S:4 R:4 D:0 E:0) [-------------------]   0% eta:  ?s

does not match qstat, which shows jobs hanging in the E state.

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
98.localhost              ...8d7bd98804b04 wlandau         00:00:00 E batch          
99.localhost              ...fcce12fedcace wlandau         00:00:00 E batch          
100.localhost             ...dc63017b37ac6 wlandau         00:00:00 E batch          
101.localhost             ...b060e52879b8e wlandau         00:00:00 E batch 

I am using the @HenrikBengtsson's torque.tmpl from future.batchtools.

Related: see my Stack Overflow post here and HenrikBengtsson/future.batchtools#12.

wlandau-lilly avatar Nov 01 '17 11:11 wlandau-lilly

Looks like the system is not set up properly. Can you submit and run jobs manually?

mllg avatar Nov 03 '17 08:11 mllg

Pretty much. For jobs that do not depend on other jobs (as opposed to drake with the future-powered parallel backend), the following test.pbs script generates the correct output.

#PBS -N test
#PBS -l nodes=1:ppn=1
#PBS -l walltime=0:01:00

cd $PBS_O_WORKDIR
touch done.txt
echo "done"

Then the job hangs in the E state indefinitely.

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
46.localhost              test             wlandau         00:00:00 E batch   

I was just using a simple qsub test.pbs.

wlandau-lilly avatar Nov 03 '17 14:11 wlandau-lilly

So the manual job also gets stuch in the E state (E for exiting)? Then this is a configuration issue.

mllg avatar Nov 06 '17 08:11 mllg

Seems about right, I just wish I knew what the right configuration was.

wlandau-lilly avatar Nov 06 '17 13:11 wlandau-lilly