reframe icon indicating copy to clipboard operation
reframe copied to clipboard

Limit number of nodes in flexible test to the WLM queue limits

Open victorusu opened this issue 4 years ago • 0 comments

Flexible tests fail if the number of idle nodes are greater than the workload manager's imposed limit on the queue. An example using SLURM is

Consider the following SLURM config

$ scontrol show part
PartitionName=normal
   ...
   MaxNodes=4
   Nodes=nid0[0001-0010]
   ...

If the number of idle nodes is 6, then ReFrame would return the output

...
==============================================================================
SUMMARY OF FAILURES
------------------------------------------------------------------------------
FAILURE INFO for ACheck
...
  * Stage directory: /tmp/stage/daint/gpu/builtin/ACheck
  * Node list: None
  * Job type: batch job (id=None)
...
  * Failing phase: run
...
  * Reason: spawned process error: command 'sbatch rfm_ACheck_job.sh' failed with exit code 1:
--- stdout ---
--- stdout ---
--- stderr ---
sbatch: error: Batch job submission failed: Requested node configuration is not available

--- stderr ---

IMHO, ReFrame should inspect the partition limit and impose that limit to the test, if the number of idle nodes is greater than the queue maximum number of nodes, in this example it would set it to 4, instead of 6.

victorusu avatar Mar 15 '21 08:03 victorusu