SlurmClusterManager.jl icon indicating copy to clipboard operation
SlurmClusterManager.jl copied to clipboard

Slurm: broken (hangs on `connecting to worker 1 out of <N>`)

Open MilesCranmer opened this issue 1 year ago • 12 comments

Is anybody maintaining this package? I haven't been able to get Slurm working for the past month or so... It just ends up stalling on connecting to worker 1 out of <N>:

julia> p = addprocs_slurm(2)
connecting to worker 1 out of <N>

The exact same code seemed to work a month ago. This is slurm 22.05.8. Not sure if this new version is breaking things or not.

MilesCranmer avatar Feb 10 '24 23:02 MilesCranmer

I don't have access to a slurm now, but it would be useful to know if a previous version was okay

Moelf avatar Feb 11 '24 01:02 Moelf

Not sure I know how to test other versions of slurm... I am stuck with whatever my institute cluster has installed

MilesCranmer avatar Feb 11 '24 02:02 MilesCranmer

I guess I this case ask the HPC admin see if they know anything that might be the problem

Moelf avatar Feb 11 '24 04:02 Moelf

Ugh, @MilesCranmer that's annoying. I also don't currently have access to a SLURM cluster... this is the kind of thing that it would be nice if we had JuliaParallel/ClusterManagers.jl#105 that could test on different schedulers :facepalm:

Would definitely be worth checking in with the cluster admin to see if SLURM was recently updated so we can at least know that that's the culprit.

kescobo avatar Feb 11 '24 18:02 kescobo

We could if this PR gets finished https://github.com/JuliaParallel/ClusterManagers.jl/pull/193

MilesCranmer avatar Feb 11 '24 18:02 MilesCranmer

@kescobo I can confirm that this issue started for me after an upgrade to Slurm on my institution's cluster. Unfortunately, I don't know what the previous version was, but currently the version is 23.11.1

cnrrobertson avatar Mar 05 '24 00:03 cnrrobertson

From the initial post, it looks like it goes back to v22

This is slurm 22.05.8

Does anyone know if SLURM follows SemVer?

kescobo avatar Mar 05 '24 01:03 kescobo

I'm getting the same error with same behaviour - code works fine otherwise now broken. Slurm version is slurm 20.11.7 and cluster admin confirms there's been no upgrade over the past year.

jewh avatar Nov 28 '24 17:11 jewh

@MilesCranmer Does the SlurmClusterManager.jl package work for you?

DilumAluthge avatar Jan 02 '25 04:01 DilumAluthge

Bump @MilesCranmer - I just wanted to check if the SlurmClusterManager.jl package works for you?

If so, I think we can add a note to the README recommending that users use SlurmClusterManager.jl, and then we can close this issue.

DilumAluthge avatar Jan 16 '25 23:01 DilumAluthge

Yes, thanks (I think once the project activation stuff merges, it should be good to go)

MilesCranmer avatar Jan 17 '25 12:01 MilesCranmer

@MilesCranmer Just closing the loop here - can you confirm that your issue been resolved by using the latest release (v1.0.0) of the SlurmClusterManager.jl package?

DilumAluthge avatar Feb 15 '25 17:02 DilumAluthge