reframe icon indicating copy to clipboard operation
reframe copied to clipboard

Issue with determining number of valid nodes for num_tasks=0

Open lagerhardt opened this issue 1 year ago • 14 comments

With 4.6.1, if you have a reservation and a test with num_tasks=0, the framework returns 0 node. I'm invoking the code with reframe -vvvvvvv -r -R -c checks/microbenchmarks/dgemm -J reservation=checkout -n dgemm_cpu -C nersc-config.py and here's what I see if I turn up the logging:

[F] Flexible node allocation requested
[CMD] 'scontrol -a show -o nodes'
[F] Total available nodes: 443
[CMD] 'scontrol -a show res checkout'
[CMD] 'scontrol -a show -o Nodes=login[01-07],nid[001000-001023,001033,001036-001037,001040-001041,001044-001045,001048-001049,001052-001053,001064-001065,001068-001069,001072-001073,001076-001077,001080-001081,001084-001085,001088-001089,001092-001093,200001-200257,200260-200261,200264-200265,200268-200269,200272-200273,200276-200277,200280-200281,200284-200285,200288-200289,200292-200293,200296-200297,200300-200301,200304-200305,200308-200309,200312-200313,200316-200317,200320-200321,200324-200325,200328-200329,200332-200333,200336-200337,200340-200341,200344-200345,200348-200349,200352-200353,200356-200357,200360-200361,200364-200365,200368-200369,200372-200373,200376-200377,200380-200381,200384-200385,200388-200389,200392-200393,200396-200397,200400-200401,200404-200405,200408-200409,200412-200413,200416-200417,200420-200421,200424-200425,200428-200429,200432-200433,200436-200437,200440-200441,200444-200445,200448-200449,200452-200453,200456-200457,200460-200461,200464-200465,200468-200469,200472-200473,200476-200477,200480-200481,200484-200485,200488-200489,200492-200493,200496-200497,200500-200501,200504-200505,200508-200509]'
[S] slurm: [F] Filtering nodes by reservation checkout: available nodes now: 0

There are available nodes in the reservation, though not all of them are available. Here's the list of states:

1 State=DOWN+DRAIN+MAINTENANCE+RESERVED+NOT_RESPONDING
2 State=DOWN+DRAIN+MAINTENANCE+RESERVED
5 State=DOWN+DRAIN+RESERVED+NOT_RESPONDING
1 State=DOWN+MAINTENANCE+RESERVED+NOT_RESPONDING
2 State=DOWN+MAINTENANCE+RESERVED+NOT_RESPONDING
1 State=DOWN+RESERVED+NOT_RESPONDING
1 State=DOWN+RESERVED
1 State=DOWN+RESERVED
7 State=IDLE+DRAIN+MAINTENANCE+RESERVED
2 State=IDLE+MAINTENANCE+RESERVED
45 State=IDLE+RESERVED
1 State=IDLE+RESERVED
370 State=IDLE+RESERVED
1 State=IDLE+RESERVED
1 State=IDLE
1 State=MIXED+RESERVED

I can only get a non-zero number if I add --flex-alloc-nodes=IDLE+RESERVED. I still get zero if I add --flex-alloc-nodes=IDLE. It was my understanding that asking for IDLE was supposed to match any of these states, but that doesn't seem to be the case. I suspect that the fact that it's doing an & between two node sets (at https://github.com/reframe-hpc/reframe/blob/392efbc8d9ae96754fb4af87027a77492c06f9f8/reframe/core/schedulers/slurm.py#L345) might have something to do with it. From my logging it looks like the node set is empty before it queries the nodes in the reservation.

lagerhardt avatar Jun 18 '24 20:06 lagerhardt