parallelly icon indicating copy to clipboard operation
parallelly copied to clipboard

HELP WANTED: availableWorkers()

Open HenrikBengtsson opened this issue 8 years ago • 1 comments

Background

When submitting a job to the TORQUE / PBS using something like:

qsub -l nodes=3:ppn=2 myjob.sh

the scheduler will allocate 3 nodes with 2 cores each (= 6 cores total) for myjob.sh when launched. Exactly which 3 nodes is only known to myjob.sh at run time. This information is available in a file $PBS_NODEFILE written by TORQUE / PBS, e.g.

$ cat $PBS_NODEFILE
n1
n1
n8
n8
n9
n9

Other HPC job schedulers use other files / environment variables for this.

Actions

Add an availableNodes() file that searches for common environment variables and returns a vector of node names, e.g.

> availableNodes()`
[1] "n1" "n1" "n8" "n8" "n9" "n9"

If no known environment variables are found, the default fallback could be to return rep("localhost", times = availableCores().

The above would allow us to make workers = availableNodes() the new default for cluster futures (currently workers = availableCores()).

Identify these settings for the following schedulers:

  • [x] PBS (Portable Batch System): Environment variable PBS_NODEFILE (the name of a file containing one node per line where each node is repeated "ppn" times).
  • [x] Oracle Grid Engine (aka Sun Grid Engine, CODINE, GRD). Environment variable PE_HOSTFILE (a file, format unclear), cf. https://www.ace-net.ca/wiki/Sun_Grid_Engine
  • [x] Slurm (Simple Linux Utility for Resource Management). Environment variable SLURM_JOB_NODELIST (list of nodes in a compressed format, e.g. instead of "tux1,tux3,tux4" it is stored as "tux[1,3-4]". Note that multiple "compressions" may exist, e.g. "compute-[0-6]-[0-15]". The number of nodes is can be verified by SLURM_JOB_NUM_NODES. The "ppn" information is in stored in SLURM_TASKS_PER_NODE).
  • [x] LSF/OpenLava (Platform Load Sharing Facility).
    • [x] LSB_HOSTS
  • [ ] Spark
  • [ ] OAR
  • [ ] HTCondor
  • [ ] Moab
  • [x] PJM (https://staff.cs.manchester.ac.uk/~fumie/internal/Job_Operation_Software_en.pdf)
    • [x] PJM_O_NODEINF - "Path of the allocated node list file. For a job to which virtual nodes are allocated, the IP addresses of the nodes where the virtual nodes are placed are written one per line."

HenrikBengtsson avatar Dec 29 '16 16:12 HenrikBengtsson

Add validation of PBS_HOSTFILE output toward counts PBS_NP and / or PBS_NUM_NODES * PBS_NUM_PPN.

HenrikBengtsson avatar Jan 07 '17 01:01 HenrikBengtsson