Elemental.jl icon indicating copy to clipboard operation
Elemental.jl copied to clipboard

Understanding Elemental's Performance

Open JBlaschke opened this issue 4 years ago • 7 comments

Hi,

I am trying to understand the performance of this program at NERSC-- it is basically the same as the example in the README.md, except that I addprocs currently doesn't work, so I am using this (manual) approach of running the MPIClusterManager using start_main_loop, and stop_main_loop

N = parse(Int64, ARGS[1])

# to import MPIManager
using MPIClusterManagers

# need to also import Distributed to use addprocs()
using Distributed

# Manage MPIManager manually -- all MPI ranks do the same work
# Start MPIManager
manager = MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL)

@mpi_do manager begin
    using MPI
    comm = MPI.COMM_WORLD
    println(
            "Hello world,"
            * " I am $(MPI.Comm_rank(comm)) of $(MPI.Comm_size(comm))"
            * " on node $(gethostname())"
           )

    println("[rank $(MPI.Comm_rank(comm))]: Importing Elemental")
    using LinearAlgebra, Elemental
    println("[rank $(MPI.Comm_rank(comm))]: Done importing Elemental")

    println("[rank $(MPI.Comm_rank(comm))]: Solving SVD for $(N)x$(N)")
end

@mpi_do manager A = Elemental.DistMatrix(Float64);
@mpi_do manager Elemental.gaussian!(A, N, N);
@mpi_do manager @time U, s, V = svd(A); 
@mpi_do manager println(s[1])

# Manage MPIManager manually:
# Elemental needs to be finalized before shutting down MPIManager
@mpi_do manager begin
    println("[rank $(MPI.Comm_rank(comm))]: Finalizing Elemental")
    Elemental.Finalize()
    println("[rank $(MPI.Comm_rank(comm))]: Done finalizing Elemental")
end
# Shut down MPIManager
MPIClusterManagers.stop_main_loop(manager)

I ran some strong scaling tests on 4 Intel Haswell nodes (https://docs.nersc.gov/systems/cori/#haswell-compute-nodes) using a 4000x4000, 8000x8000, and 16000x16000 random matrix.

chart

I am measuring only the svd(A) time. I am attaching my measured times, and wanted to check if this is what you would expect. I am not an expert in how Elemental computes SVDs in a distributed fashion, and so would would be grateful for any advise you have for optimizing this benchmark's performance. In particular, I am interested in understanding what the optimal number of ranks are as a function of problem size (I am hoping that this is such an obvious questions, that you can point me to some existing documentation).

Cheers!

JBlaschke avatar Oct 03 '21 05:10 JBlaschke

First, it might be useful to confirm that the same pattern shows up with you try to compile a C++ version of this problem.

andreasnoack avatar Oct 05 '21 12:10 andreasnoack

That was what I was thinking. Unfortunately I am not familiar with how to use Elemental, and the docs hosting seems to be broken (and I can't find the docs sources either). Do you know where I can find a copy of the full docs? I am looking for the C++ equivalent of: Elemental.DistMatrix, Elemental.gaussian!, and svd, so that I can replicate the example above in C++.

I am able to build libEl

Cheers, Johannes

JBlaschke avatar Oct 06 '21 20:10 JBlaschke

It looks like you can still browse the html version of the documentation although it doesn't render correctly. I think the best place for you to look is https://github.com/LLNL/Elemental/blob/hydrogen/tests/lapack_like/SVD.cpp#L157. It should be possible adapt that test to something similar to the example above.

andreasnoack avatar Oct 06 '21 20:10 andreasnoack

Thanks for the blob -- I'll try to understand it given the docs that I can find. At this point I only understand 10%. Btw, not all of the docs can be browsed: https://elemental.github.io/documentation/0.85/core/dist_matrix.html

JBlaschke avatar Oct 06 '21 21:10 JBlaschke

The source for the documentation is at https://github.com/elemental/elemental-web. I've asked your colleague at LLNL if they could start hosting the docs since they are already maintaining the fork of Elemental, https://github.com/LLNL/Elemental/issues/80#issuecomment-937447310.

andreasnoack avatar Oct 07 '21 05:10 andreasnoack

Thanks! I'll also look into hosting that locally.

FTR: NERSC is at LBNL, and LBNL != LLNL. It's a common misunderstanding, and we are all friends.

JBlaschke avatar Oct 07 '21 15:10 JBlaschke

I had the pleasure of spending some days at NERSC a couple of years ago while working on a project where we ran Julia code on Cori so I'm well aware that it's two different labs. The "colleagues" was in the sense that you both are under DOE. The folks at Livermore forked Elemental a couple of years ago so it would make sense for them to host the documentation but if you don't mind doing it would also be great.

andreasnoack avatar Oct 07 '21 18:10 andreasnoack