exo icon indicating copy to clipboard operation
exo copied to clipboard

Support for FLASH

Open samiamjidkhan opened this issue 2 months ago • 2 comments

Motivation

Enables Exo to orchestrate FLASH MPI simulations across distributed Mac clusters. FLASH is a multiphysics astrophysics simulation code used for hydrodynamics, MHD, and radiation transport research.

Previously, running distributed FLASH simulations required manual hostfile creation and coordination. This integration allows Exo to:

  • Automatically discover nodes via existing mDNS/gossipsub
  • Generate MPI hostfiles from topology
  • Launch and monitor simulations via REST API
  • Report status through the existing dashboard infrastructure

Changes

New files:

  • src/exo/worker/runner/flash_runner.py - MPI runner that generates hostfiles and spawns mpirun

Modified files:

  • src/exo/shared/types/worker/instances.py - Added FLASHInstance type
  • src/exo/shared/types/commands.py - Added LaunchFLASH and StopFLASH commands
  • src/exo/master/placement.py - Added place_flash_instance() for node selection
  • src/exo/master/main.py - Added command handlers for FLASH launch/stop
  • src/exo/master/api.py - Added REST endpoints: POST /flash/launch, GET /flash/instances, DELETE /flash/{instance_id}
  • src/exo/worker/plan.py - Added _plan_flash() for FLASH instance lifecycle
  • src/exo/worker/runner/bootstrap.py - Added routing to flash_runner for FLASHInstance

Why It Works

EXO already handles node discovery and topology management. This PR bridges EXO's topology to MPI by:

  • Placement: Selects nodes from EXO topology, creates shard assignments (one runner per node)
  • Hostfile generation: Converts node IPs to hostnames (for SSH config matching) and writes MPI hostfile
  • Coordinator pattern: Rank 0 node runs mpirun --hostfile, which SSHs to other nodes to spawn processes
  • Status reporting: Runners emit standard EXO events (RunnerRunning, RunnerFailed, etc.)

MPI handles actual inter-process communication during simulation - Exo just orchestrates the spawn.

Test Plan

Manual Testing

Hardware: 2x Mac Studio M2 Ultra 512GB

What we tested:

  • Started Exo on both nodes, verified discovery via /state endpoint (2 nodes visible)
  • Launched FLASH Sod shock-tube simulation via POST /flash/launch
  • Verified hostfile generated correctly: s14 slots=1 + james21-1 slots=1
  • Simulation completed 151 timesteps across both nodes
  • Output files created: sod_hdf5_chk_0000, sod_hdf5_chk_0001, sod_hdf5_plt_cnt_0000

Automated Testing

N/A

samiamjidkhan avatar Jan 12 '26 05:01 samiamjidkhan

Updated to remove the need for SSH entirely. It now uses a built-in remote execution system called exo-rsh, where each EXO node runs a small HTTP server for remote command execution and mpirun spawns processes without SSH, keys, or passwords. To use it, simply run EXO (uv run exo), and launch simulations via the API! @AlexCheema

samiamjidkhan avatar Jan 13 '26 06:01 samiamjidkhan

Updated to simplify exo-rsh by consolidating its HTTP server into EXO's existing API. Instead of running a separate server on port 52416, remote command execution now uses the /execute endpoint on EXO's main API (port 52415). Same SSH-free functionality, one less server. @AlexCheema

samiamjidkhan avatar Jan 15 '26 03:01 samiamjidkhan

Update: Plugin Architecture & SLURM CLI

Summary

Added a modular plugin system for exo. New capabilities can be added by dropping code into src/exo/plugins/implementations/ without modifying core source code. FLASH is the first plugin.

Plugin Interface

class ExoPlugin(ABC):
    name: str                                    # Plugin identifier
    version: str                                 # Semantic version
    get_command_types() -> Sequence[type]        # Commands to handle
    get_instance_type() -> type                  # Instance type to manage
    get_api_routes() -> Sequence[tuple]          # HTTP endpoints
    process_command(...) -> Sequence[Event]      # Command processing
    plan_task(...) -> Task | None                # Worker planning
    create_runner(...)                           # Runner bootstrap

SLURM-Compatible CLI

# Submit job
exo sbatch -J my_sim -N 2 --hosts s14,james21 -D /tmp /path/to/flash4

# View queue
exo squeue
JOBID     NAME    NODES  STATE
bc0ba007  my_sim  2      RUNNING

# Cancel job
exo scancel bc0ba007

How It Works

  1. sbatch calls /flash/launch API endpoint
  2. Plugin places job across nodes
  3. Coordinator runs mpirun with exo-rsh (no SSH needed)
  4. exo-rsh spawns processes via each node's /execute endpoint

Files Changed

Added: src/exo/plugins/ (base, registry, context) + implementations/flash/

Modified: api.py, main.py, plan.py, bootstrap.py - added plugin hooks

Removed: FLASH-specific code from core files

Tested

  • 1-node and 2-node distributed FLASH simulations
  • sbatch, squeue, scancel commands

samiamjidkhan avatar Jan 20 '26 03:01 samiamjidkhan