parallelly
parallelly copied to clipboard
HELP WANTED: availableWorkers()
Background
When submitting a job to the TORQUE / PBS using something like:
qsub -l nodes=3:ppn=2 myjob.sh
the scheduler will allocate 3 nodes with 2 cores each (= 6 cores total) for myjob.sh
when launched. Exactly which 3 nodes is only known to myjob.sh
at run time. This information is available in a file $PBS_NODEFILE
written by TORQUE / PBS, e.g.
$ cat $PBS_NODEFILE
n1
n1
n8
n8
n9
n9
Other HPC job schedulers use other files / environment variables for this.
Actions
Add an availableNodes()
file that searches for common environment variables and returns a vector of node names, e.g.
> availableNodes()`
[1] "n1" "n1" "n8" "n8" "n9" "n9"
If no known environment variables are found, the default fallback could be to return rep("localhost", times = availableCores()
.
The above would allow us to make workers = availableNodes()
the new default for cluster
futures (currently workers = availableCores()
).
Identify these settings for the following schedulers:
- [x] PBS (Portable Batch System): Environment variable
PBS_NODEFILE
(the name of a file containing one node per line where each node is repeated "ppn" times). - [x] Oracle Grid Engine (aka Sun Grid Engine, CODINE, GRD). Environment variable
PE_HOSTFILE
(a file, format unclear), cf. https://www.ace-net.ca/wiki/Sun_Grid_Engine - [x] Slurm (Simple Linux Utility for Resource Management). Environment variable
SLURM_JOB_NODELIST
(list of nodes in a compressed format, e.g. instead of "tux1,tux3,tux4" it is stored as "tux[1,3-4]". Note that multiple "compressions" may exist, e.g. "compute-[0-6]-[0-15]". The number of nodes is can be verified bySLURM_JOB_NUM_NODES
. The "ppn" information is in stored inSLURM_TASKS_PER_NODE
). - [x] LSF/OpenLava (Platform Load Sharing Facility).
- [x]
LSB_HOSTS
- [x]
- [ ] Spark
- [ ] OAR
- [ ] HTCondor
- [ ] Moab
- [x] PJM (https://staff.cs.manchester.ac.uk/~fumie/internal/Job_Operation_Software_en.pdf)
- [x]
PJM_O_NODEINF
- "Path of the allocated node list file. For a job to which virtual nodes are allocated, the IP addresses of the nodes where the virtual nodes are placed are written one per line."
- [x]
Add validation of PBS_HOSTFILE
output toward counts PBS_NP
and / or PBS_NUM_NODES
* PBS_NUM_PPN
.