crew.cluster icon indicating copy to clipboard operation
crew.cluster copied to clipboard

Monitor classes for SLURM, PBS, and LSF

Open wlandau opened this issue 1 year ago • 17 comments

Prework

  • [x] Read and agree to the Contributor Code of Conduct and contributing guidelines.
  • [x] If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
  • [x] New features take time and effort to create, and they take even more effort to maintain. So if the purpose of the feature is to resolve a struggle you are encountering personally, please consider first posting a GitHub discussion.
  • [x] Format your code according to the tidyverse style guide.

Proposal

crew.cluster 0.2.0 supports a new "monitor" class to help list and terminate SGE jobs from R instead of the command line. https://wlandau.github.io/crew.cluster/index.html#monitoring shows an example using crew_monitor_sge():

monitor <- [crew_monitor_sge](https://wlandau.github.io/crew.cluster/reference/crew_monitor_sge.html)()
job_list <- monitor$jobs()
job_list
#> # A tibble: 2 × 9
#>   job_number prio    name    owner state start_time queue_name jclass_name slots
#>   <chr>      <chr>   <chr>   <chr> <chr> <chr>      <chr>      <lgl>       <chr>
#> 1 131853812  0.05000 crew-m… USER… r     2024-01-0… all.norma… NA          1    
#> 2 131853813  0.05000 crew-m… USER… r     2024-01-0… all.norma… NA          1
monitor$terminate(jobs = job_list$job_number)
#> USER has registered the job 131853812 for deletion
#> USER has registered the job 131853813 for deletion
monitor$jobs()
#> data frame with 0 columns and 0 rows

Currently only SGE is supported. I would like to add other monitor classes for other clusters, but I do not have access to SLURM, PBS, or LSF. cc'ing @nviets, @brendanf, and/or @mglev1n, in case you are interested.

wlandau avatar Jan 08 '24 19:01 wlandau

Hi @wlandau - just to confirm my understanding, you're proposing we add, for instance crew_monitor_slurm(), and all related bits following crew_monitor_sge.R?

nviets avatar Jan 09 '24 16:01 nviets

Yes, exactly! On SGE, the hardest part for me was parsing job status information. I had to dig into the XML because the non-XML output from qstat is not machine-readable. Other than that, we would just use SLURM's commands instead of qstat/qdel. The R6 boilerplate should be a simple copy/paste.

wlandau avatar Jan 09 '24 17:01 wlandau

The R6 boilerplate should be a simple copy/paste.

Actually, first I would like to simplify this part by creating a common abstract parent class for all the monitors to inherit from...

wlandau avatar Jan 09 '24 17:01 wlandau

I'll give some thought to slurm. There are the usual slurm commands (squeue, scancel, etc...) whose output we could parse, but there's also a DB (optional and typically used in larger installations) that could be queried. Maybe the former is better at least in the short term, since not everyone will have the DB.

nviets avatar Jan 09 '24 19:01 nviets

Thanks for looking into this! In the end I would prefer something that all/most SLURM users would be able to use.

By the way, as of 8cf036bf95be4dd0a99cab34eb43fda7fa6fda52 I created parent monitor class that all cluster-specific monitors inherit from: https://github.com/wlandau/crew.cluster/blob/main/R/crew_monitor_cluster.R. This helps reduce duplicated code/docs. The SGE monitor is much shorter now and easy to copy: https://github.com/wlandau/crew.cluster/blob/main/R/crew_monitor_sge.R. Tests are at https://github.com/wlandau/crew.cluster/blob/main/tests/testthat/test-crew_monitor_sge.R and https://github.com/wlandau/crew.cluster/blob/main/tests/sge/monitor.R.

wlandau avatar Jan 09 '24 19:01 wlandau

To make sure I understand, the monitor is only for interactive use? So the data.frame which is output by jobs() does not need to have any particular column names?

brendanf avatar Feb 20 '24 06:02 brendanf

There are two options for squeue that I am aware of: parse the standard output, which is a fixed with table (optionally the columns and widths can be specified with the -o or -O options if we don't trust the defaults will be the same for all users):

# this is the default format given in `man squeue`, but specify it
# in case some user's configuration is different
default_format <- "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"
text <- system2(
  "squeue",
  args = shQuote(c("-u", user, "-o", default_format)),
  stdout = TRUE,
  stderr = if_any(private$.verbose, "", FALSE),
  wait = TRUE
)
con <- textConnection(text)
out <- read.fwf(
  con,
  widths = c(18, -1, 9, -1, 8, -1, 8, -1,  2, -1, 10, -1, 6, -1, 100),
  skip = 1,
  col.names = c("JOBID", "PARTITION", "NAME", "USER", "ST", "TIME", "NODES", "NODELIST_REASON"),
  strip.white = TRUE
)
tibble::as_tibble(out)
## A tibble: 7 × 8
#     JOBID PARTITION NAME     USER     ST    TIME  NODES NODELIST_REASON
#     <int> <chr>     <chr>    <chr>    <chr> <chr> <int> <chr>          
#1 20504876 small     crew-Opt brfurnea R     52:46     1 r18c36         
#2 20504877 small     crew-Opt brfurnea R     52:46     1 r18c23         
#3 20504863 small     crew-Opt brfurnea R     52:50     1 r18c41         
#4 20504851 small     crew-Opt brfurnea R     53:06     1 r18c33         
#5 20504854 small     crew-Opt brfurnea R     53:06     1 r18c35         
#6 20504857 small     crew-Opt brfurnea R     53:06     1 r18c40         
#7 20504848 small     OptimOTU brfurnea R     53:35     1 r18c43

The second option is slurm --yaml, which gives a full dump of the entire queue. Arguments like -u do nothing to filter the output, so the monitor would have to do this itself. Especially on a big cluster, this is a lot of data:

text <- system2("squeue", args = shQuote("--yaml"), stdout = TRUE, stderr = FALSE, wait = TRUE)
length(text)
# [1] 269314

This both because there are a lot of jobs, but also because it gives all possible fields, more than 100 per job.

My feeling is that option 1 is the way to go, despite the fact that fixed-width outputs may cut some values (for instance, NAME above).

brendanf avatar Feb 20 '24 07:02 brendanf

That's a tough choice, and it's a shame that the more structured YAML-based is large. How large exactly, in terms of the size of the output and the execution time? I am concerned that subtle variations from cluster to cluster and odd things like spaces in job names could interfere with the standard output.

wlandau avatar Feb 21 '24 13:02 wlandau

On my cluster, slurm --yaml returned 11Mb in 0.7s. Parsing the result with yaml::read_yaml() took about 1.4s. At the time of my test there were 2166 jobs in the queue. If it's only going to be used interactively, it's probably acceptable, but I certainly would not want to call it often in a script.

brendanf avatar Feb 21 '24 16:02 brendanf

Yeah, monitor objects are just for interactive use. I think those performance metrics are not terrible as long as the documentation gives the user a heads up.

wlandau avatar Feb 21 '24 16:02 wlandau

The yaml queue dump includes 111 fields for each job, some of which are themselves structured; e.g. one field is "job resources" which looks like this:

job_resources
job_resources$nodes
[1] "r15c35"

job_resources$allocated_cores
[1] 6

job_resources$allocated_hosts
[1] 1

job_resources$allocated_nodes
job_resources$allocated_nodes[[1]]
job_resources$allocated_nodes[[1]]$sockets
job_resources$allocated_nodes[[1]]$sockets$`0`
job_resources$allocated_nodes[[1]]$sockets$`0`$cores
job_resources$allocated_nodes[[1]]$sockets$`0`$cores$`6`
[1] "allocated"

job_resources$allocated_nodes[[1]]$sockets$`0`$cores$`7`
[1] "allocated"

job_resources$allocated_nodes[[1]]$sockets$`0`$cores$`8`
[1] "allocated"

job_resources$allocated_nodes[[1]]$sockets$`0`$cores$`9`
[1] "allocated"

job_resources$allocated_nodes[[1]]$sockets$`0`$cores$`10`
[1] "allocated"

job_resources$allocated_nodes[[1]]$sockets$`0`$cores$`11`
[1] "allocated"




job_resources$allocated_nodes[[1]]$nodename
[1] "r15c35"

job_resources$allocated_nodes[[1]]$cpus_used
[1] 6

job_resources$allocated_nodes[[1]]$memory_used
[1] 12288

job_resources$allocated_nodes[[1]]$memory_allocated
[1] 12288

brendanf avatar Feb 21 '24 17:02 brendanf

This code approximately recreates the default squeue output. I substituted start time for elapsed time, because the yaml does not actually include elapsed time, and I want to avoid situations where, e.g., I am using UTC while SLURM is configured to use local time or vice versa.

user <- ps::ps_username()
monitor_cols <- c("job_id", "partition", "name", "user_name", "job_state",
       "start_time", "node_count", "state_reason")
text <- system2(
  "squeue",
  args = "--yaml",
  stdout = TRUE,
#stderr = ifany(private$.verbose, "", FALSE),
  wait = TRUE
)
yaml = yaml::read_yaml(text = text)
out <- map(
  yaml$jobs,
  ~ tibble::new_tibble(
    c(
     map(.x[monitor_cols], ~ unlist(.x) %||% NA),
     list(nodes = paste(unlist(.x$job_resources$nodes), collapse = ",") %||% NA)
    )
  )
)
out <- do.call(vctrs::vec_rbind, out)
out <- out[out$user_name == user,]
out$start_time <- as.POSIXct(out$start_time, origin = "1970-01-01")
out

# A tibble: 14 × 9
     job_id partition name    user_name job_state start_time          node_count
      <int> <chr>     <chr>   <chr>     <chr>     <dttm>                   <int>
 1 20386512 longrun   R_Moth… guilbaul  RUNNING   2024-02-09 09:05:33          1
 2 20386513 longrun   R_Moth… guilbaul  RUNNING   2024-02-09 09:05:33          1
 3 20386514 longrun   R_Moth… guilbaul  RUNNING   2024-02-09 09:05:33          1
 4 20386515 longrun   R_Moth… guilbaul  RUNNING   2024-02-09 09:05:33          1
 5 20386516 longrun   R_Moth… guilbaul  RUNNING   2024-02-09 09:05:33          1
 6 20386517 longrun   R_Moth… guilbaul  RUNNING   2024-02-09 09:05:33          1
 7 20386509 longrun   R_Moth… guilbaul  RUNNING   2024-02-09 09:05:33          1
 8 20446032 longrun   R_Moth… guilbaul  RUNNING   2024-02-14 09:27:25          1
 9 20446033 longrun   R_Moth… guilbaul  RUNNING   2024-02-14 09:27:25          1
10 20446034 longrun   R_Moth… guilbaul  RUNNING   2024-02-14 09:27:25          1
11 20446035 longrun   R_Moth… guilbaul  RUNNING   2024-02-14 09:27:25          1
12 20446036 longrun   R_Moth… guilbaul  RUNNING   2024-02-14 09:27:25          1
13 20446037 longrun   R_Moth… guilbaul  RUNNING   2024-02-14 09:27:25          1
14 20446004 longrun   R_Moth… guilbaul  RUNNING   2024-02-14 09:27:25          1
# ℹ 2 more variables: state_reason <chr>, nodes <chr>

brendanf avatar Feb 22 '24 11:02 brendanf

Nice! Got time for a PR?

wlandau avatar Feb 22 '24 19:02 wlandau

Sorry I was pulled away from this thread by work. The yaml option looks like a much better approach than parsing squeue, but I think it requires an extra plugin and minimum slurm version. It would be worth adding a warning or something. See: Why am I getting the following error: "Unable to find plugin: serializer/json"?.

nviets avatar Feb 23 '24 20:02 nviets

It looks like the LSF job output can similarly be parsed either using the fixed-width table, or JSON (see example below) - this would add a jsonlite dependency:

text <- system2(
  "bjobs",
  args = c("-o 'user jobid job_name stat queue slots mem start_time run_time'", "-json"),
  stdout = TRUE,
  wait = TRUE
)
json <- jsonlite::fromJSON(text)
out <- json$RECORDS
out
user <- ps::ps_username()
text <- system2(
    "bjobs",
    args = c("-o 'user jobid job_name stat queue slots mem start_time run_time'", "-json"),
    stdout = TRUE,
    #stderr = ifany(private$.verbose, "", FALSE),
    wait = TRUE
)
json <- jsonlite::fromJSON(text)
out <- json$RECORDS
out

     USER    JOBID JOB_NAME STAT               QUEUE SLOTS         MEM   START_TIME         RUN_TIME
1 mglevin 25900189     bash  RUN voltron_interactive     1    8 Mbytes Feb 29 09:12    313 second(s)
2 mglevin 25900201     bash  RUN voltron_interactive     1    2 Mbytes Feb 29 09:17     22 second(s)
3 mglevin 25665912  rstudio  RUN     voltron_rstudio     2 87.9 Gbytes Feb 26 15:36 236482 second(s)

mglev1n avatar Feb 29 '24 14:02 mglev1n

Awesome! jsonlite is super lightweight and reliable, I don't mind it as a dependency.

Would you be willing to open a PR?

wlandau avatar Feb 29 '24 18:02 wlandau

just here to say hi, still early days for me with {crew} but I'm excited to learn, I have access to SLURM and PBS, and I'm reading along

mdsumner avatar Aug 01 '24 07:08 mdsumner