Add SLURM script for launching multi-node Ray clusters with Singularity
Description
This PR adds a sample SLURM script and accompanying documentation for running NeMo Curator pipelines on a multi-node Ray cluster using Singularity / Apptainer.
Specifically, it:
- Adds
ray-singularity-sbatch.sh, a generic SLURM batch script that:- Starts a Ray head on the first SLURM node and Ray workers on the remaining nodes.
- Runs a user-provided Python command inside a NeMo Curator container on the head node.
- Supports both Singularity and Apptainer via a
CONTAINER_CMDknob. - Is safe for air-gapped clusters by default via
HF_HUB_OFFLINE=1.
- Adds a README documenting:
- Prerequisites (NeMo Curator container, SLURM, Singularity/Apptainer).
- How the script works and how to customize SBATCH directives.
- All relevant environment knobs (ports, HF cache, scratch paths, mounts, etc.).
- Example usage patterns for NeMo Curator pipelines.
No existing code paths are modified; this is an example script + documentation intended to make it easier for users to run NeMo Curator on SLURM-based HPC systems.
Similar to #1168 but for Slurm clusters with Singularity and no internet connection on compute nodes.
Usage
from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna.executor import XennaExecutor
# Define your pipeline
pipeline = Pipeline(...)
pipeline.add_stage(...)
# Use the XennaExecutor to run on the Ray cluster started by the sbatch script
executor = XennaExecutor()
results = pipeline.run(executor=executor)
On the SLURM side, the corresponding submission looks like:
export IMAGE=/path/to/nemo-curator_25.09.sif
RUN_COMMAND="python curator_pipeline.py" \
sbatch --nodes=2 --gres=gpu:4 ray-singularity-sbatch.sh
Checklist
- [X] I am familiar with the Contributing Guide.
- [X] New or Existing tests cover these changes.
- [X] The documentation is up to date with these changes.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
Greptile Overview
Greptile Summary
Added deployment script and documentation for launching multi-node Ray clusters on SLURM systems using Singularity/Apptainer containers.
- Introduced
ray-singularity-sbatch.sh: configurable SLURM batch script that starts Ray head and worker nodes in containers - Added comprehensive README documenting setup, configuration options, and usage patterns
- Script supports air-gapped clusters by default (
HF_HUB_OFFLINE=1) - Provides resource auto-detection and flexible environment variable configuration
- Includes proper cleanup handling via trap for temporary directories
Minor syntax improvements suggested for quoted path variables to handle edge cases with spaces.
Confidence Score: 4/5
- This PR is safe to merge with minimal risk - adds isolated deployment tooling without modifying existing code
- Score reflects that this is a self-contained addition (new files only) with comprehensive documentation and follows shell scripting best practices. Only minor quoting issues found that should be addressed for robustness with paths containing spaces. No functional logic errors or security issues detected.
- No files require special attention - only minor syntax improvements suggested for edge case handling
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| tutorials/deployment/slurm/ray-singularity-sbatch.sh | 4/5 | Well-structured SLURM script that launches Ray clusters with Singularity/Apptainer, includes proper resource detection and cleanup handling |
| tutorials/deployment/slurm/README.md | 5/5 | Comprehensive documentation covering prerequisites, configuration, usage patterns, and troubleshooting for the SLURM Ray cluster script |
Sequence Diagram
sequenceDiagram
participant User
participant SLURM
participant Script as ray-singularity-sbatch.sh
participant HeadNode as Head Node (Singularity)
participant WorkerNodes as Worker Nodes (Singularity)
participant Ray as Ray Cluster
User->>SLURM: sbatch with RUN_COMMAND
SLURM->>Script: Allocate nodes & launch script
Script->>Script: Parse SLURM_JOB_NODELIST
Script->>Script: Create temp directories (ray_tmp, ray_workers_tmp, ray_spill)
Script->>Script: Detect resources (CPUs, GPUs)
Script->>Script: Configure environment & mounts
Script->>HeadNode: srun - Start Ray head on node[0]
HeadNode->>Ray: ray start --head (GCS_PORT, DASH_PORT, CLIENT_PORT)
Note over HeadNode,Ray: Wait HEAD_STARTUP_WAIT seconds
loop For each worker node[1..N]
Script->>WorkerNodes: srun - Start Ray worker
WorkerNodes->>Ray: ray start --address HEAD_IP:GCS_PORT
end
Note over WorkerNodes,Ray: Wait WORKER_STARTUP_WAIT seconds
Script->>HeadNode: srun - Execute RUN_COMMAND
HeadNode->>Ray: Python script connects to Ray cluster
Ray-->>HeadNode: Execute pipeline tasks
HeadNode-->>Script: Command completes
Script->>Script: Cleanup trap fires (EXIT/INT/TERM)
Script->>Script: Remove temp directories
Script-->>SLURM: Job complete
SLURM-->>User: Results in output files