Curator icon indicating copy to clipboard operation
Curator copied to clipboard

Add SLURM script for launching multi-node Ray clusters with Singularity

Open federico-dambrosio opened this issue 1 month ago • 2 comments

Description

This PR adds a sample SLURM script and accompanying documentation for running NeMo Curator pipelines on a multi-node Ray cluster using Singularity / Apptainer.

Specifically, it:

  • Adds ray-singularity-sbatch.sh, a generic SLURM batch script that:
    • Starts a Ray head on the first SLURM node and Ray workers on the remaining nodes.
    • Runs a user-provided Python command inside a NeMo Curator container on the head node.
    • Supports both Singularity and Apptainer via a CONTAINER_CMD knob.
    • Is safe for air-gapped clusters by default via HF_HUB_OFFLINE=1.
  • Adds a README documenting:
    • Prerequisites (NeMo Curator container, SLURM, Singularity/Apptainer).
    • How the script works and how to customize SBATCH directives.
    • All relevant environment knobs (ports, HF cache, scratch paths, mounts, etc.).
    • Example usage patterns for NeMo Curator pipelines.

No existing code paths are modified; this is an example script + documentation intended to make it easier for users to run NeMo Curator on SLURM-based HPC systems.

Similar to #1168 but for Slurm clusters with Singularity and no internet connection on compute nodes.

Usage

from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna.executor import XennaExecutor

# Define your pipeline
pipeline = Pipeline(...)
pipeline.add_stage(...)

# Use the XennaExecutor to run on the Ray cluster started by the sbatch script
executor = XennaExecutor()
results = pipeline.run(executor=executor)

On the SLURM side, the corresponding submission looks like:

export IMAGE=/path/to/nemo-curator_25.09.sif

RUN_COMMAND="python curator_pipeline.py" \
sbatch --nodes=2 --gres=gpu:4 ray-singularity-sbatch.sh

Checklist

  • [X] I am familiar with the Contributing Guide.
  • [X] New or Existing tests cover these changes.
  • [X] The documentation is up to date with these changes.

federico-dambrosio avatar Nov 24 '25 14:11 federico-dambrosio

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Nov 24 '25 14:11 copy-pr-bot[bot]

Greptile Overview

Greptile Summary

Added deployment script and documentation for launching multi-node Ray clusters on SLURM systems using Singularity/Apptainer containers.

  • Introduced ray-singularity-sbatch.sh: configurable SLURM batch script that starts Ray head and worker nodes in containers
  • Added comprehensive README documenting setup, configuration options, and usage patterns
  • Script supports air-gapped clusters by default (HF_HUB_OFFLINE=1)
  • Provides resource auto-detection and flexible environment variable configuration
  • Includes proper cleanup handling via trap for temporary directories

Minor syntax improvements suggested for quoted path variables to handle edge cases with spaces.

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk - adds isolated deployment tooling without modifying existing code
  • Score reflects that this is a self-contained addition (new files only) with comprehensive documentation and follows shell scripting best practices. Only minor quoting issues found that should be addressed for robustness with paths containing spaces. No functional logic errors or security issues detected.
  • No files require special attention - only minor syntax improvements suggested for edge case handling

Important Files Changed

File Analysis

Filename Score Overview
tutorials/deployment/slurm/ray-singularity-sbatch.sh 4/5 Well-structured SLURM script that launches Ray clusters with Singularity/Apptainer, includes proper resource detection and cleanup handling
tutorials/deployment/slurm/README.md 5/5 Comprehensive documentation covering prerequisites, configuration, usage patterns, and troubleshooting for the SLURM Ray cluster script

Sequence Diagram

sequenceDiagram
    participant User
    participant SLURM
    participant Script as ray-singularity-sbatch.sh
    participant HeadNode as Head Node (Singularity)
    participant WorkerNodes as Worker Nodes (Singularity)
    participant Ray as Ray Cluster
    
    User->>SLURM: sbatch with RUN_COMMAND
    SLURM->>Script: Allocate nodes & launch script
    
    Script->>Script: Parse SLURM_JOB_NODELIST
    Script->>Script: Create temp directories (ray_tmp, ray_workers_tmp, ray_spill)
    Script->>Script: Detect resources (CPUs, GPUs)
    Script->>Script: Configure environment & mounts
    
    Script->>HeadNode: srun - Start Ray head on node[0]
    HeadNode->>Ray: ray start --head (GCS_PORT, DASH_PORT, CLIENT_PORT)
    Note over HeadNode,Ray: Wait HEAD_STARTUP_WAIT seconds
    
    loop For each worker node[1..N]
        Script->>WorkerNodes: srun - Start Ray worker
        WorkerNodes->>Ray: ray start --address HEAD_IP:GCS_PORT
    end
    Note over WorkerNodes,Ray: Wait WORKER_STARTUP_WAIT seconds
    
    Script->>HeadNode: srun - Execute RUN_COMMAND
    HeadNode->>Ray: Python script connects to Ray cluster
    Ray-->>HeadNode: Execute pipeline tasks
    HeadNode-->>Script: Command completes
    
    Script->>Script: Cleanup trap fires (EXIT/INT/TERM)
    Script->>Script: Remove temp directories
    Script-->>SLURM: Job complete
    SLURM-->>User: Results in output files

greptile-apps[bot] avatar Nov 24 '25 14:11 greptile-apps[bot]