oci-hpc icon indicating copy to clipboard operation
oci-hpc copied to clipboard

Potential problem handling array jobs

Open cbutakoff opened this issue 1 year ago • 2 comments

I have an array job limited to 2 jobs at a time:

   2145_[4-190%2]   compute   EP_108      opc PD       0:00     10 (JobArrayTaskLimit)
   2145_3   compute   EP_108      opc  R      48:42     10 compute-hpc-node-[100,373,397,421,425,429,455,457,813,896]
   2145_2   compute   EP_108      opc  R    4:13:06     10 compute-hpc-node-[69,237,245,272,347,553,724,817,931,993]

But slurm or oci tries to still provision the 3rd cluster and fails (because of lack of available nodes) but it just keeps on retrying. E.g.: Selection_022

cbutakoff avatar Apr 27 '23 14:04 cbutakoff

Fixed by modifying the queues.conf in principle and setting max clusters to 2

cbutakoff avatar Apr 28 '23 10:04 cbutakoff

autoscaling with Arrays should be fixed in 2.10.4.

arnaudfroidmont avatar Jan 08 '24 16:01 arnaudfroidmont