redun icon indicating copy to clipboard operation
redun copied to clipboard

Any interest in integrating PSI/J HPC cluster job management into redun?

Open tesujimath opened this issue 7 months ago • 2 comments

We built a fairly sophisticated pipeline using Redun, making use of a Slurm compute cluster via PSI/J.

The PSI/J stuff was fairly well encapsulated in a single Python module called cluster_executor.

Are you interested in getting this extracted from our codebase and integrated into Redun?

Overview of cluster executor

It uses PSI/J rather than, say, Dask. I initially tried Dask, but rejected it because of the job submission model it uses. Briefly, Dask runs a generic worker process on the compute cluster, and hands off jobs to it. This is somewhat convenient, except that it obscures which programs are actually running. The point is, tuning the compute cluster resource requests (CPUs and memory) requires separating which job is running. For example, the resources required to run, say, bcl_convert are very different from those required for, say, fastqc, so we need to see these as entirely separate jobs in the accounting records.

It uses Jsonnet to define the job resources required. Jsonnet provides nice abstractons for not repeatng common configuraton between jobs, which you can see our use of here. There are no Slurm specifics in the code, they are all encapsulated in the Jsonnet configuration, including the fact that the executor is Slurm. However, this has only been used so far with Slurm.

There are two main submission functions: run_job_1 which has a simplified interface and expects a single output file, and run_job_n which is rather general and can cater for multiple output files, both expected paths and globs.

(There is also an interface for submitting jobs and catching failure immediately, rather than leaning on Redun's exception handling and recovery.)

I understand that some adaptation would be required to incorporate this into Redun, but it seems worth raising awareness of what we did and asking the question here.

tesujimath avatar May 21 '25 03:05 tesujimath

Thanks for sharing this! I didn't know about PSI/J Python. That is very handy. I've been interested in creating a HPC Executor for sometime, but interacting directly with such systems seemed a bit daunting. If that python library already abstracts over several HPC systems and gives a higher level API, that could be just the trick.

I will take a look at your approach and get back to you. Some of our other Executors were implemented as collaborations (GoogleBatchExecutor, K8SExecutor).

mattrasmus avatar May 23 '25 00:05 mattrasmus

I extracted the PSI/J executor as a separate Python package, available on PyPI or as a Nix flake. Perhaps that's enough at least until others start using it?

tesujimath avatar Nov 10 '25 21:11 tesujimath