flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

Running Flux under Spindle may need a re-investigation

Open SteVwonder opened this issue 4 years ago • 5 comments

@vchuravy attempted to run Flux under Spindle, but ran into some errors:

bash-4.2$ srun -N ${SLURM_NNODES} -n ${SLURM_NNODES} --pty --mpi=none --mpibind=off flux start -- bash            
bash-4.2$ exit
bash-4.2$ spindle srun -N ${SLURM_NNODES} -n ${SLURM_NNODES} --pty --mpi=none --mpibind=off flux start -- bash
... hangs
srun: error: quartz16: task 0: Exited with exit code 255
srun: error: quartz19: task 3: Exited with exit code 255
srun: error: quartz18: task 2: Exited with exit code 255
srun: error: quartz17: task 1: Exited with exit code 255

And

bash-4.2$ spindle --slurm --no-mpi srun -N ${SLURM_NNODES} -n ${SLURM_NNODES} --pty --mpi=none --mpibind=off flux start -- bash
2021-05-01T19:11:27.405720Z broker.err[1]: rc1.0: ERROR: ld.so: object '/var/tmp/churavy1/spindle.72914/0-_usr_tce_packages_spindle_spindle_lib_spindle_libspindle_audit_pipe.so' cannot be loaded as audit interface: cannot open shared object file; ignored. 

This is a lower priority issue for Valentin, but I'll add to my todo list to give it a run myself since we'll need this support eventually when Flux is the RM on LC systems. We might need to add certain flags like -a no and/or --slurm based on #1514. Once we have the right incantation, we can add it to our docs.

SteVwonder avatar May 03 '21 03:05 SteVwonder

Does /var/tmp/churavy1/spindle.72914/0-_usr_tce_packages_spindle_spindle_lib_spindle_libspindle_audit_pipe.so exist?

Not sure if this is a Flux issue or Spindle issue from the error message. It looks as if 0-_usr_tce_packages_spindle_spindle_lib_spindle_libspindle_audit_pipe.so couldn't be loaded into the LD_AUDIT that Spindle is using... What OS is this? Could there be an interface change in LD_AUDIT?

dongahn avatar May 03 '21 16:05 dongahn

What OS is this? Could there be an interface change in

This was on quartz with the spindle module in lmod.

What I wanted in particular was to run Jobspec submitted to the broker with spindle. I am not that worried about starting flux with spindle, but I suspect I need to have that so that spindle can catch the exec of the jobspec.

vchuravy avatar May 03 '21 17:05 vchuravy

What I wanted in particular was to run Jobspec submitted to the broker with spindle. I am not that worried about starting flux with spindle, but I suspect I need to have that so that spindle can catch the exec of the jobspec.

Yeah since spindle works at the slurm level, this is necessary. Ideally, Spindle can be directly integrated with flux so that it only distributes the shared objects for the exec of the jobspec. But I don't think that work is not there yet. But even then, when flux nests, Spindle will have to deal with the same situation of needing to relocate flux shared objects so it would be good to fix the problem with the current mode.

dongahn avatar May 03 '21 18:05 dongahn

@dongahn as we discussed at SC, it would be great to have a solid Spindle/Flux integration

vchuravy avatar Nov 17 '21 21:11 vchuravy

Tagging @jameshcorbett for now and I will discuss this with @mplegendre and @jameshcorbett when I get back in town.

dongahn avatar Nov 18 '21 03:11 dongahn

Spindle integration with Flux is a pending PR here: https://github.com/hpc/Spindle/pull/50. Closing this issue.

grondo avatar Apr 11 '23 13:04 grondo