ClusterManagers.jl icon indicating copy to clipboard operation
ClusterManagers.jl copied to clipboard

SGE: can't even add processors

Open dpo opened this issue 6 years ago • 7 comments

julia> VERSION
v"1.1.0"

julia> using ClusterManagers#master

julia> ClusterManagers.addprocs_qrsh(5, queue="hs22")
Error launching workers
MethodError(iterate, (Process(`qrsh -q hs22 -V -N julia-26131 -now n cd /home/dorban '&&' /apps/local-fci/tools/julia-1.1.0/bin/julia --worker=zwJ987ih2wft8egg`, ProcessRunning),), 0x00000000000063e1)
0-element Array{Int64,1}

My cluster doesn't support qrsh. Submitting jobs from the command line works fine. Any ideas on how to get this package to work?

dpo avatar Sep 01 '19 05:09 dpo

For some reason, ClusterManagers.addprocs_sge sometimes succeeds. I tried to add 28 workers, and it's been printing dots for over an hour now. Is this expected?!

dpo avatar Sep 02 '19 00:09 dpo

I am not sure I understand this issue. You are running on a SGE cluster and you are trying to add processes with addprocs_qrsh? That won't work.

For some reason, ClusterManagers.addprocs_sge sometimes succeeds. I tried to add 28 workers, and it's been printing dots for over an hour now. Is this expected?!

That sounds like a bug. Can you post your entire script? I haven't run on SGE in a while, but on SLURM there is salloc which you can use to allocate resources before you use srun. Maybe there is something similar for SGE?

vchuravy avatar Sep 02 '19 20:09 vchuravy

Apologies, that was a cut and paste of the wrong piece of code (I tried qrsh following some comments found on this issue tracker). I use addprocs_sge, but it doesn't succeed consistently. My script is easy enough:

using ClusterManagers
ClusterManagers.addprocs_sge(28, queue="hs22")

There are indeed 28 compute nodes available, but addprocs_sge never returns. I'll look for something similar to salloc, thanks.

dpo avatar Sep 02 '19 20:09 dpo

Yeah so either you get stuck in allocating forever or ~the time-out doesn't trigger.~ Looks like qsub doesn't have a time-out. You may want to instrument this code.

https://github.com/JuliaParallel/ClusterManagers.jl/blob/e375f50f2c4eab3d3f4cefcea3465c82734cfb71/src/qsub.jl#L83

vchuravy avatar Sep 02 '19 20:09 vchuravy

Hi, I encounter the same issue, here Julia 1.3.1 SGE 8.1.8

Running: addprocs_sge(4, queue=$ProvideByAdmin)

I get: Error launching workers MethodError(iterate, (Base.ProcessChain(Base.Process[Process(echo 'cd /home/alequa/Documents/Research/phd_project/simulations/tripod && /home/alequa/Documents/Research/julia-1.3.1/bin/julia --worker=sFj9Hl4l94yYKPoz', ProcessExited(0)), Process(qsub -N julia-27738 -terse -j y -R y -t 1-1 -V -q single.q, ProcessRunning)], Base.DevNull(), Base.PipeEndpoint(RawFD(0x00000019) open, 0 bytes waiting), Base.DevNull()),), 0x0000000000006890) 0-element Array{Int64,1}

BUT the jobs start and I can get them on qstat

1552178 0.55500 julia-2773 alequa r 03/19/2020 18:56:59 [email protected] 1 1 1552178 0.55500 julia-2773 alequa r 03/19/2020 18:56:59 [email protected] 1 2

Can you help? Thanks, Alessio

aquaresima avatar Mar 19 '20 17:03 aquaresima

These sporadic issues where sometimes it works and sometimes it doesn't are quite annoying. I am experiencing something similar with the LSF manager. Will try to debug in the next days, and will share here if I learn something that could be ported to the SGE manager.

juliohm avatar Oct 06 '20 20:10 juliohm