Slurm: broken (hangs on `connecting to worker 1 out of <N>`)
Is anybody maintaining this package? I haven't been able to get Slurm working for the past month or so... It just ends up stalling on connecting to worker 1 out of <N>:
julia> p = addprocs_slurm(2)
connecting to worker 1 out of <N>
The exact same code seemed to work a month ago. This is slurm 22.05.8. Not sure if this new version is breaking things or not.
I don't have access to a slurm now, but it would be useful to know if a previous version was okay
Not sure I know how to test other versions of slurm... I am stuck with whatever my institute cluster has installed
I guess I this case ask the HPC admin see if they know anything that might be the problem
Ugh, @MilesCranmer that's annoying. I also don't currently have access to a SLURM cluster... this is the kind of thing that it would be nice if we had JuliaParallel/ClusterManagers.jl#105 that could test on different schedulers :facepalm:
Would definitely be worth checking in with the cluster admin to see if SLURM was recently updated so we can at least know that that's the culprit.
We could if this PR gets finished https://github.com/JuliaParallel/ClusterManagers.jl/pull/193
@kescobo I can confirm that this issue started for me after an upgrade to Slurm on my institution's cluster. Unfortunately, I don't know what the previous version was, but currently the version is 23.11.1
From the initial post, it looks like it goes back to v22
This is slurm 22.05.8
Does anyone know if SLURM follows SemVer?
I'm getting the same error with same behaviour - code works fine otherwise now broken. Slurm version is slurm 20.11.7 and cluster admin confirms there's been no upgrade over the past year.
@MilesCranmer Does the SlurmClusterManager.jl package work for you?
Bump @MilesCranmer - I just wanted to check if the SlurmClusterManager.jl package works for you?
If so, I think we can add a note to the README recommending that users use SlurmClusterManager.jl, and then we can close this issue.
Yes, thanks (I think once the project activation stuff merges, it should be good to go)
@MilesCranmer Just closing the loop here - can you confirm that your issue been resolved by using the latest release (v1.0.0) of the SlurmClusterManager.jl package?