Trixi.jl icon indicating copy to clipboard operation
Trixi.jl copied to clipboard

Cannot `using Trixi` on a cluster (RAMSES)

Open efaulhaber opened this issue 7 months ago • 5 comments

I'm on a node of the UoC's cluster RAMSES, and I get this:

julia> using Trixi
slurmstepd: error: mpi/pmi2: invalid pmi1 command received: 'init'

It just freezes there. This is coming from init_mpi() in __init__().

Trixi.jl is not usable on this system, as it doesn't get past the initialization.

efaulhaber avatar May 14 '25 17:05 efaulhaber

For some reason, it now works on the same machine. Closing for now.

efaulhaber avatar May 15 '25 14:05 efaulhaber

Update: It's back.

efaulhaber avatar May 16 '25 08:05 efaulhaber

This sounds like something you need to discuss with your cluster admin?

Does this reproduce solely with MPI.Init()? Please post MPI.versioninfo()

vchuravy avatar May 16 '25 08:05 vchuravy

You're right:

julia> using MPI

julia> MPI.versioninfo()
MPIPreferences:
  binary:  MPICH_jll
  abi:     MPICH

Package versions
  MPI.jl:             0.20.22
  MPIPreferences.jl:  0.1.11
  MPICH_jll:          4.3.0+1

Library information:
  libmpi:  /scratch/efaulha2/.julia/artifacts/05d8c79b270470018e9de8dd24ddb6d7954aff9d/lib/libmpi.so
  libmpi dlpath:  /scratch/efaulha2/.julia/artifacts/05d8c79b270470018e9de8dd24ddb6d7954aff9d/lib/libmpi.so
  MPI version:  4.1.0
  Library version:  
    MPICH Version:      4.3.0
    MPICH Release date: Mon Feb  3 09:09:47 AM CST 2025
    MPICH ABI:          17:0:5
    MPICH Device:       ch3:nemesis
    MPICH configure:    --build=x86_64-linux-musl --disable-dependency-tracking --disable-doc --enable-fast=ndebug,O3 --enable-static=no --host=x86_64-linux-gnu --prefix=/workspace/destdir --with-device=ch3 --with-hwloc=/workspace/destdir
    MPICH CC:           cc     -DNDEBUG -DNVALGRIND -O3
    MPICH CXX:          c++   -DNDEBUG -DNVALGRIND -O3
    MPICH F77:          gfortran   -O3
    MPICH FC:           gfortran   -O3
    MPICH features:     
    

julia> MPI.Init()
slurmstepd: error: mpi/pmi2: invalid pmi1 command received: 'init'

efaulhaber avatar May 16 '25 09:05 efaulhaber

It is likely that slurm is doing some shenanigans and notices that you are trying to use a non-slurm MPI. E.g. MPICH_jll using the pmi1 protocol instead of the pmi2 protocol to initialize computations.

I would recommend using the system MPI directly. https://juliaparallel.org/MPI.jl/latest/configuration/#using_system_mpi

vchuravy avatar May 16 '25 10:05 vchuravy