Elemental.jl icon indicating copy to clipboard operation
Elemental.jl copied to clipboard

How to go from files to distributed matrix.

Open dh-ilight opened this issue 4 years ago • 3 comments

I have files each holding 1 column of an array. I would like to create an Elemental.DistMatrix from these files. I would like to load the DistMatrix in parallel. An earlier question was answered by pointing to Elemental/test/lav.jl I made the following program by extracting from lav.jl. It works for a single node and hangs for 2 nodes using mpiexecjl. I am using Julia 1.5 on a 4 core machine running Centos 7.5 Please let me know what is wrong with the program and how to load my column array files in parallel. I intend to eventually run a program using DistMatrix on a computer with hundreds of cores.

# to import MPIManager
using MPIClusterManagers, Distributed

# Manage MPIManager manually -- all MPI ranks do the same work
# Start MPIManager
manager = MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL)

# Init an Elemental.DistMatrix
@everywhere function spread(n0, n1)
println("start spread")
height = n0*n1
width = n0*n1
h= El.Dist(n0)
w= El.Dist(n1)
A = El.DistMatrix(Float64)
El.gaussian!(A, n0, n1) # how to init size ?
localHeight = El.localHeight(A)
println("localHeight ", localHeight)
El.reserve(A, 6*localHeight) # number of queue entries
println("after reserve")
for sLoc in 1:localHeight
s = El.globalRow(A, sLoc)
x0 = ((s-1) % n0) + 1
x1 = div((s-1), n0) + 1
El.queueUpdate(A, s, s, 11.0)
println("sLoc $sLoc, x0 $x0")
if x0 > 1
El.queueUpdate(A, s, s - 1, -10.0)
println("after q")
end
if x0 < n0
El.queueUpdate(A, s, s + 1, 20.0)
end
if x1 > 1
El.queueUpdate(A, s, s - n0, -30.0)
end
if x1 < n1
El.queueUpdate(A, s, s + n0, 40.0)
end
# The dense last column
# El.queueUpdate(A, s, width, floor(-10/height))
end # for
println("before processQueues")
El.processQueues(A)
println("after processQueues") # with 2 nodes never gets here
return A
end

@mpi_do manager begin
using MPI, LinearAlgebra, Elemental
const El = Elemental
res = spread(4,4)
println( "res=" , res)

# Manage MPIManager manually:
# Elemental needs to be finalized before shutting down MPIManager
# println("[rank $(MPI.Comm_rank(comm))]: Finalizing Elemental")
Elemental.Finalize()
# println("[rank $(MPI.Comm_rank(comm))]: Done finalizing Elemental")
end # mpi_do

# Shut down MPIManager
MPIClusterManagers.stop_main_loop(manager)

Thank you

dh-ilight avatar Oct 11 '21 15:10 dh-ilight

Based on a NERSC user ticket which inspired #73 @andreasnoack

~@dhiepler can you put the code snippet in a code block (put ```julia at the beginning and ``` at the end)~

JBlaschke avatar Oct 11 '21 16:10 JBlaschke

The program looks right to me. To debug this, I'd try to remove the MPIClusterManagers, Distributed parts and then run the script with mpiexec like we do in https://github.com/JuliaParallel/Elemental.jl/blob/83089155659739fea1aae476c6fd492b1ee20850/test/runtests.jl#L19

andreasnoack avatar Oct 12 '21 07:10 andreasnoack

FTR @dhiepler on Cori that would be

srun -n $NUM_RANKS julia path/to/test.jl

JBlaschke avatar Oct 12 '21 21:10 JBlaschke