Support for FLASH
Motivation
Enables Exo to orchestrate FLASH MPI simulations across distributed Mac clusters. FLASH is a multiphysics astrophysics simulation code used for hydrodynamics, MHD, and radiation transport research.
Previously, running distributed FLASH simulations required manual hostfile creation and coordination. This integration allows Exo to:
- Automatically discover nodes via existing mDNS/gossipsub
- Generate MPI hostfiles from topology
- Launch and monitor simulations via REST API
- Report status through the existing dashboard infrastructure
Changes
New files:
- src/exo/worker/runner/flash_runner.py - MPI runner that generates hostfiles and spawns mpirun
Modified files:
- src/exo/shared/types/worker/instances.py - Added FLASHInstance type
- src/exo/shared/types/commands.py - Added LaunchFLASH and StopFLASH commands
- src/exo/master/placement.py - Added place_flash_instance() for node selection
- src/exo/master/main.py - Added command handlers for FLASH launch/stop
- src/exo/master/api.py - Added REST endpoints: POST /flash/launch, GET /flash/instances, DELETE /flash/{instance_id}
- src/exo/worker/plan.py - Added _plan_flash() for FLASH instance lifecycle
- src/exo/worker/runner/bootstrap.py - Added routing to flash_runner for FLASHInstance
Why It Works
EXO already handles node discovery and topology management. This PR bridges EXO's topology to MPI by:
- Placement: Selects nodes from EXO topology, creates shard assignments (one runner per node)
- Hostfile generation: Converts node IPs to hostnames (for SSH config matching) and writes MPI hostfile
- Coordinator pattern: Rank 0 node runs mpirun --hostfile, which SSHs to other nodes to spawn processes
- Status reporting: Runners emit standard EXO events (RunnerRunning, RunnerFailed, etc.)
MPI handles actual inter-process communication during simulation - Exo just orchestrates the spawn.
Test Plan
Manual Testing
Hardware: 2x Mac Studio M2 Ultra 512GB
What we tested:
- Started Exo on both nodes, verified discovery via /state endpoint (2 nodes visible)
- Launched FLASH Sod shock-tube simulation via POST /flash/launch
- Verified hostfile generated correctly: s14 slots=1 + james21-1 slots=1
- Simulation completed 151 timesteps across both nodes
- Output files created: sod_hdf5_chk_0000, sod_hdf5_chk_0001, sod_hdf5_plt_cnt_0000
Automated Testing
N/A
Updated to remove the need for SSH entirely. It now uses a built-in remote execution system called exo-rsh, where each EXO node runs a small HTTP server for remote command execution and mpirun spawns processes without SSH, keys, or passwords. To use it, simply run EXO (uv run exo), and launch simulations via the API! @AlexCheema
Updated to simplify exo-rsh by consolidating its HTTP server into EXO's existing API. Instead of running a separate server on port 52416, remote command execution now uses the /execute endpoint on EXO's main API (port 52415). Same SSH-free functionality, one less server. @AlexCheema
Update: Plugin Architecture & SLURM CLI
Summary
Added a modular plugin system for exo. New capabilities can be added by dropping code into src/exo/plugins/implementations/ without modifying core source code. FLASH is the first plugin.
Plugin Interface
class ExoPlugin(ABC):
name: str # Plugin identifier
version: str # Semantic version
get_command_types() -> Sequence[type] # Commands to handle
get_instance_type() -> type # Instance type to manage
get_api_routes() -> Sequence[tuple] # HTTP endpoints
process_command(...) -> Sequence[Event] # Command processing
plan_task(...) -> Task | None # Worker planning
create_runner(...) # Runner bootstrap
SLURM-Compatible CLI
# Submit job
exo sbatch -J my_sim -N 2 --hosts s14,james21 -D /tmp /path/to/flash4
# View queue
exo squeue
JOBID NAME NODES STATE
bc0ba007 my_sim 2 RUNNING
# Cancel job
exo scancel bc0ba007
How It Works
-
sbatchcalls/flash/launchAPI endpoint - Plugin places job across nodes
- Coordinator runs
mpirunwithexo-rsh(no SSH needed) -
exo-rshspawns processes via each node's/executeendpoint
Files Changed
Added: src/exo/plugins/ (base, registry, context) + implementations/flash/
Modified: api.py, main.py, plan.py, bootstrap.py - added plugin hooks
Removed: FLASH-specific code from core files
Tested
- 1-node and 2-node distributed FLASH simulations
- sbatch, squeue, scancel commands