SGE: can't even add processors
julia> VERSION
v"1.1.0"
julia> using ClusterManagers#master
julia> ClusterManagers.addprocs_qrsh(5, queue="hs22")
Error launching workers
MethodError(iterate, (Process(`qrsh -q hs22 -V -N julia-26131 -now n cd /home/dorban '&&' /apps/local-fci/tools/julia-1.1.0/bin/julia --worker=zwJ987ih2wft8egg`, ProcessRunning),), 0x00000000000063e1)
0-element Array{Int64,1}
My cluster doesn't support qrsh. Submitting jobs from the command line works fine. Any ideas on how to get this package to work?
For some reason, ClusterManagers.addprocs_sge sometimes succeeds. I tried to add 28 workers, and it's been printing dots for over an hour now. Is this expected?!
I am not sure I understand this issue. You are running on a SGE cluster and you are trying to add processes with addprocs_qrsh? That won't work.
For some reason, ClusterManagers.addprocs_sge sometimes succeeds. I tried to add 28 workers, and it's been printing dots for over an hour now. Is this expected?!
That sounds like a bug. Can you post your entire script? I haven't run on SGE in a while, but on SLURM there is salloc which you can use to allocate resources before you use srun. Maybe there is something similar for SGE?
Apologies, that was a cut and paste of the wrong piece of code (I tried qrsh following some comments found on this issue tracker). I use addprocs_sge, but it doesn't succeed consistently. My script is easy enough:
using ClusterManagers
ClusterManagers.addprocs_sge(28, queue="hs22")
There are indeed 28 compute nodes available, but addprocs_sge never returns. I'll look for something similar to salloc, thanks.
Yeah so either you get stuck in allocating forever or ~the time-out doesn't trigger.~ Looks like qsub doesn't have a time-out. You may want to instrument this code.
https://github.com/JuliaParallel/ClusterManagers.jl/blob/e375f50f2c4eab3d3f4cefcea3465c82734cfb71/src/qsub.jl#L83
Hi, I encounter the same issue, here Julia 1.3.1 SGE 8.1.8
Running:
addprocs_sge(4, queue=$ProvideByAdmin)
I get:
Error launching workers MethodError(iterate, (Base.ProcessChain(Base.Process[Process(echo 'cd /home/alequa/Documents/Research/phd_project/simulations/tripod && /home/alequa/Documents/Research/julia-1.3.1/bin/julia --worker=sFj9Hl4l94yYKPoz', ProcessExited(0)), Process(qsub -N julia-27738 -terse -j y -R y -t 1-1 -V -q single.q, ProcessRunning)], Base.DevNull(), Base.PipeEndpoint(RawFD(0x00000019) open, 0 bytes waiting), Base.DevNull()),), 0x0000000000006890) 0-element Array{Int64,1}
BUT the jobs start and I can get them on qstat
1552178 0.55500 julia-2773 alequa r 03/19/2020 18:56:59 [email protected] 1 1 1552178 0.55500 julia-2773 alequa r 03/19/2020 18:56:59 [email protected] 1 2
Can you help? Thanks, Alessio
These sporadic issues where sometimes it works and sometimes it doesn't are quite annoying. I am experiencing something similar with the LSF manager. Will try to debug in the next days, and will share here if I learn something that could be ported to the SGE manager.