reframe
reframe copied to clipboard
Issue with determining number of valid nodes for num_tasks=0
With 4.6.1, if you have a reservation and a test with num_tasks=0, the framework returns 0 node. I'm invoking the code with reframe -vvvvvvv -r -R -c checks/microbenchmarks/dgemm -J reservation=checkout -n dgemm_cpu -C nersc-config.py and here's what I see if I turn up the logging:
[F] Flexible node allocation requested
[CMD] 'scontrol -a show -o nodes'
[F] Total available nodes: 443
[CMD] 'scontrol -a show res checkout'
[CMD] 'scontrol -a show -o Nodes=login[01-07],nid[001000-001023,001033,001036-001037,001040-001041,001044-001045,001048-001049,001052-001053,001064-001065,001068-001069,001072-001073,001076-001077,001080-001081,001084-001085,001088-001089,001092-001093,200001-200257,200260-200261,200264-200265,200268-200269,200272-200273,200276-200277,200280-200281,200284-200285,200288-200289,200292-200293,200296-200297,200300-200301,200304-200305,200308-200309,200312-200313,200316-200317,200320-200321,200324-200325,200328-200329,200332-200333,200336-200337,200340-200341,200344-200345,200348-200349,200352-200353,200356-200357,200360-200361,200364-200365,200368-200369,200372-200373,200376-200377,200380-200381,200384-200385,200388-200389,200392-200393,200396-200397,200400-200401,200404-200405,200408-200409,200412-200413,200416-200417,200420-200421,200424-200425,200428-200429,200432-200433,200436-200437,200440-200441,200444-200445,200448-200449,200452-200453,200456-200457,200460-200461,200464-200465,200468-200469,200472-200473,200476-200477,200480-200481,200484-200485,200488-200489,200492-200493,200496-200497,200500-200501,200504-200505,200508-200509]'
[S] slurm: [F] Filtering nodes by reservation checkout: available nodes now: 0
There are available nodes in the reservation, though not all of them are available. Here's the list of states:
1 State=DOWN+DRAIN+MAINTENANCE+RESERVED+NOT_RESPONDING
2 State=DOWN+DRAIN+MAINTENANCE+RESERVED
5 State=DOWN+DRAIN+RESERVED+NOT_RESPONDING
1 State=DOWN+MAINTENANCE+RESERVED+NOT_RESPONDING
2 State=DOWN+MAINTENANCE+RESERVED+NOT_RESPONDING
1 State=DOWN+RESERVED+NOT_RESPONDING
1 State=DOWN+RESERVED
1 State=DOWN+RESERVED
7 State=IDLE+DRAIN+MAINTENANCE+RESERVED
2 State=IDLE+MAINTENANCE+RESERVED
45 State=IDLE+RESERVED
1 State=IDLE+RESERVED
370 State=IDLE+RESERVED
1 State=IDLE+RESERVED
1 State=IDLE
1 State=MIXED+RESERVED
I can only get a non-zero number if I add --flex-alloc-nodes=IDLE+RESERVED. I still get zero if I add --flex-alloc-nodes=IDLE. It was my understanding that asking for IDLE was supposed to match any of these states, but that doesn't seem to be the case. I suspect that the fact that it's doing an & between two node sets (at https://github.com/reframe-hpc/reframe/blob/392efbc8d9ae96754fb4af87027a77492c06f9f8/reframe/core/schedulers/slurm.py#L345) might have something to do with it. From my logging it looks like the node set is empty before it queries the nodes in the reservation.