Pigeons.jl Distributed without MPI and Threading

Hi all,

A few more questions about Pigeons. I am trying to use Pigeons on my university's cluster, which uses Slurm.

I see no performance difference between using threads and not using threads. I have started Julia with 4 threads and am using 4 chains. I didn't expect a four-fold speedup but I expected at least some difference. Perhaps importantly, my problem involves making a closure which contains a large object and a memory-intensive operation, the inverse of a matrix.

log_likelihood = let large_obj = large_obj
    input_data -> memory_intensive_function(input_data, large_obj)
end

When should I expect performance improvements from threading? Usually I think of threading as being useful for stuff like speeding up loops and matrix operations, not for operations as complicated as an MCMC. What is pigeons doing when multithreaded is set to true?

I'm trying to do distributed computing. I am trying, and failing to do setup_mpi, getting the error

julia> Pigeons.setup_mpi(; submission_system = :slurm)
MPI library could not be found with the following name(s):
    ["libmpi", "libmpi_ibm", "msmpi", "libmpich", "libmpi_cray", "libmpitrampoline"]

(I have no idea what the MPI library is called on my cluster, but am working on it).

But more generally, why is MPI the only option for distributed computing? What does Pigeons.jl not support distributed programming using the Distributed library? I don't think I could expect users of my package to configure MPI in order to take advantage of parallelization.

Thank you,

Peter

Mar 27 '25 20:03 pdeffebach

For the first point...

I see no performance difference between using threads and not using threads.

To help troubleshoot, try the following which should work out of the box:

pigeons(target = toy_mvn_target(1000), explorer = AutoMALA())
pigeons(target = toy_mvn_target(1000), explorer = AutoMALA(), multithreaded = true)

On my Mac laptop and launching Julia with 8 threads, I get 6.82s for the last round single threaded and 1.74s multithreaded. Do you see similar improvement? If not, maybe make sure the Julia process is indeed launched with several threads (via julia --threads 8).

What is pigeons doing when multithreaded is set to true?

The explorers for each chains will do the exploration in parallel.

When should I expect performance improvements from threading?

Since exploration is often the bottleneck and there is not communication between processed needed in that phase, it is often helpful from my experience, but as always there is an overhead to multithreading so it is always good to double check it indeed helps.

As for why it is not helping in your specific case, I wonder if the memory intensive aspect is maybe the cause.

Also just to make sure, in your example, large_obj is read only, correct?

Mar 28 '25 14:03 alexandrebouchard

Another suggestion for you: in the special case such as yours where the number of chains is smaller or equal to the number of cores, the following might work well (it uses a local MPI but does not need to setup MPI, i.e. can run even on a laptop).

pigeons(target = toy_mvn_target(1000), explorer = AutoMALA(), on = ChildProcess(n_local_mpi_processes=10))

E.g., in my case I have 10 cores on my laptop so that actually ran slightly faster, 1.25s (by default, pigeons uses 10 chains).

Mar 28 '25 14:03 alexandrebouchard

For the second point...

(I have no idea what the MPI library is called on my cluster, but am working on it).

It might be needed to load a module or something of that sorts.

But more generally, why is MPI the only option for distributed computing?

We started with MPI because it can be more efficient than TCP/IP which is what the Distributed library is primarily focused on. The MPI primitives are also a more natural fit to the specific task of writing distributed PT.

What does Pigeons.jl not support distributed programming using the Distributed library? I don't think I could expect users of my package to configure MPI in order to take advantage of parallelization.

That's a very good suggestion, I would be interested to add that functionality. Could we stay in touch about this? Since I have not used Distributed personally it would be helpful to know how a user would expect Pigeons to work in the context of that package. I have seen somewhere a library built on top of Distributed.jl that provides a mapping to MPI-style APIs but I can't recall how it is called. Anybody knows that library by any chance?

Mar 28 '25 14:03 alexandrebouchard

Ah! found it, it is part of DistributedArrays.jl

Using that, it might not be too hard to generalize our codebase to use either MPI or Distributed.jl

Thoughts?

Mar 28 '25 17:03 alexandrebouchard

Sorry for the late reply! It took me a few days to get a grip on things.

Threading: I can confirm that the toy_mvn_target gets a speed-up from multi-threading. However my actual use-case still shows no improvement when I run with multi-threading, on both my personal laptop and the cluster. I guess it's not that big a deal that threading doesn't improve performance, and probably has to do with the fact that we do a inv of a decent-sized matrix inside each iteration of MCMC.

Distributed computing: On my laptop I installed openmpi and could get parallel computing working with ChildProcess. On the cluster, I was able to get ChildProcess parallelization to work after module load openmpi.

Additionally, with my cluster (at Boston University), I also have to start a "special" interactive session in order to run ChildProcess. See here (This actually might not be true, and I might be getting confused with MPIProcesses).

However I cannot get MPIProcesses to work. It looks like Pigeons is not creating a submission_script.sh file that it uses to spawn MPI sessions? The error is as follows:

ERROR: IOError: could not spawn `sbatch /usr3/graduate/peterwd/Documents/Projects/HighDimensionalOptimalPolicies.jl/examples/OptimalTransport/results/all/2025-04-02-14-53-47-6ap9bBnD/.submission_script.sh`: no such file or directory (ENOENT)

This might be a bug in Pigeons? The code-base of Pigeons.jl doesn't show any reference to creating a .submission_script.sh file.

However I do see a reasonable speed-up with ChildProcess parallelization. Running on 8 child-processes, I get a speedup of a factor of 4 relative to running on a single process. I'm not sure the value isn't higher, but I guess this is okay.

Revise.jl: In working on this distributed computing, I learned that Revise.jl doesn't seem to work when I'm also using ChildProcess. This might be do to my workflow? I'm trying to do all my analysis in a separate module (i.e. created with generate). This module is called OptimalTransport. I need to add OptimalTransport as a dependency to the child processes. I do that like so.

Solver function

function run_solver(net, β)
    initfun, objfun, nextfun = let net = net
        initfun = rng -> get_initial_upgrade(rng, net)
        objfun = edges_to_upgrade -> begin
            new_net = get_upgraded_network(edges_to_upgrade, net)
            welfare(new_net)
        end
        nextfun = (rng, edges_to_upgrade) -> swap_edges_to_upgrade(rng, edges_to_upgrade)
        initfun, objfun, nextfun
    end


    out = HDOP.get_best_policy(
        HDOP.PigeonsSolver(); 
        initfun, 
        objfun, 
        nextfun, 
        β, 
        # Sent to pigeons
        checkpoint = true,
        multithreaded = true,
        n_chains = 8,
    	on = MPIProcesses(
    		n_mpi_processes = 8,
            n_threads = 8,
    		dependencies = [HighDimensionalOptimalPolicies, OptimalTransport, Plots]
    	),
	)

    out = @set out.pt = Pigeons.load(out.pt)
    
    (; net, out)
    last_policy = HDOP.get_last_policy(out)
    average_policy = HDOP.get_average_policy(out)
    p = plot_network(net; average_edges_to_upgrade = average_policy)
  #  display(p)

    (; net, out)
end

I think importantly, initfun, objfun, and nextfun are anonymous functions, defined locally inside a function. I think there is some world age issue, such that the OptimalTransport module that gets passed as a dependency is not the same one that's being currently edited (due to something about Revise modifying the module). So then when I use revise, the anonymous function initfun, will be different, and I get an UndefError about #104 not defined or similar.

Distributed.jl My goal is as follows: An economist without a super advanced knowledge of cluster computing (i.e. does not know what MPI is) has a high-dimensional optimization problem (probably with discrete inputs) and they want to use our method (which uses Parallel Tempering) to find a set of optimal policies.

I am trying to build on top of Pigeons.jl to create a simple API for them to do that. This is perhaps overly optimistic, but I was really hoping I would be able to do something as simple as the following:

julia -p 8
using HighDimensionalOptimalPolicies # loads Distributed.jl
get_best_policy(num_cores = 8)

and then the algorithm would be split across 8 cores. The key value of Distributed.jl was that the user wouldn't have to worry about module load openmpi or setup_mpi(), the latter of which I still haven't gotten to work.

Of course, I didn't anticipate that multi threading would be useful in some instances, and that MPI would be a better protocol for Parallel Tempering than the TCP/IP that Distributed.jl uses. So I think I was a bit naive going into this project.

So some concrete questions for you:

How can I get MPIProcesses working? Am I correct that this might be a Pigeons.jl bug?
How can I get Revise.jl working?
Do you have any tips for how make a package wrapping Pigeons.jl be more user-friendly?

Sorry for such a long post. I'm going to tag @nikola-sur , to not burden you with too much of this.

Best,

Peter

Apr 02 '25 19:04 pdeffebach

Hi Peter,

Sorry for the delay! End of term has been crazy busy...

Distributed computing: On my laptop I installed openmpi and could get parallel computing working with ChildProcess

For ChildProcess-style MPI (i.e., all on the same physical machine, and using the argument n_local_mpi_processes = ...), it should not be necessary to install openmpi, as MPI.jl ships with its own binaries inside.

Additionally, with my cluster (at Boston University), I also have to start a "special" interactive session in order to run ChildProcess

If running via ChildProcess-style MPI on the cluster (again, meaning all processes on the same physical machines, and using the argument n_local_mpi_processes = ...), all you really need is to request enough processes. So one way to achieve this is indeed to have an interactive session with enough processes, but there are other ways to do so.

However I cannot get MPIProcesses to work.

In response to this thread and another one, I have pushed a bunch of updates on MPIProcesses. Release 0.4.9 from 4 days ago fixes some issues.

Revise.jl: In working on this distributed computing, I learned that Revise.jl doesn't seem to work

That's a good point, I don't think Revise has this kind of features. But ChildProcess will by design have to reload the packages, so I don't think Revise would be needed on these (i.e. make sure Revise is not in your project Toml, but rather in the global environment, and loaded just before you activate your current project/package. I think that's the recommended way of using Revise. Personally, I have a special alias, "j" which differentiates the way I call Julia interactively vs how it gets called programmatically (e.g. when Pigeons starts a child process, or via Distributed.jl or some other mean).

How can I get MPIProcesses working?

First of all, note this is only needed if the number of processes you need exceeds the number of cores in one machine. Second, if you do need that, as mentioned above my next recommendation would be to try 0.4.9. If that does not work maybe we can try to do a zoom call?

How can I get Revise.jl working?

See above.

Do you have any tips for how make a package wrapping Pigeons.jl be more user-friendly?

I think what you are describing above should be completely achievable. I can have a more detailed look if you want. Is there a repo available and instructions to run the main function that would need more parallelism? Also let me know if you are looking for more cores that what fits in one machine in your cluster.

Best, Alex

Apr 22 '25 03:04 alexandrebouchard