BlackBoxOptim.jl icon indicating copy to clipboard operation
BlackBoxOptim.jl copied to clipboard

WIP/RFC: parallel population optimizer

Open alyst opened this issue 8 years ago • 15 comments

This is an attempt to implement some parallelization. Of course, the most straightforward way would be to parallelize fitness calculation in rank_by_fitness!(), but since Julia doesn't have threads [yet], it would mean lots of data sending/synchronization overhead. So the goal was to have something that require less messaging and synchronization points.

ParallelPopulationOptimizer basically starts N independent population optimizers on worker processes. After migrationPeriod steps each worker randomly selects migrationSize individuals and sends them to the master process. Master process just constantly listens to the incoming immigrants and distributes them among the other (N-1) workers (+ collects the best ones). The acceptance of the immigrants by the workers happens through the standard tell!() interface: an immigrant replaces some random "aborigine" if it is more fit.

Parallelization is implemented using native Julian RemoteRef, Channel and Task mechanisms. Channels were added only in Julia v0.4, so it would not work on v0.3.

It's definitely not for production use yet, because there's no handling of exceptions in the worker processes. Also, I don't know how efficient is this scheme in terms of convergence, and whether there's a way to improve it. At least in my experiments I can run 8 workers in parallel, and immigrants acceptance rate is around 20%.

alyst avatar Jul 20 '15 16:07 alyst

Interesting. This is a classic Master/Slave optimization scheme then and might be useful and should be able to speed up convergence (vs time). However, presently it fails the travis CI build (0.3 issue?) so not sure how we handle this right now.

robertfeldt avatar Jul 21 '15 05:07 robertfeldt

Indeed it does speed up the convergence (interestingly, contrary to what R's DEoptim recommends (population size = 10 x number of parameters), for my problem it works best with small (50) population on each worker).

I guess it will never run on 0.3, because it misses the key functionality. Since the PR uses MessageUtils master, which has no tagged version, it is not straightforward to run on 0.4 either. When/if the new parallel building blocks would land to Julia master, I will update the PR. It could be disabled on 0.3 to make the tests pass.

So it's not ready to merge, but I though it's worth publishing as PR to provide some parallelization starting point for those that might be interested.

alyst avatar Jul 21 '15 09:07 alyst

Parallelization is great.

The functions I optimize generally use a lot of temporary memory, so I generally add a set preallocated arrays as an argument. When optimizing the function, I preallocate arrays, and then feed bboptimize with a closure

pa = PreallocateArray()
bboptimize(x -> f(x, pa))

I'm not sure how closures work with parallel workers. One solution would be to recreate arrays at every iterations

bboptimize(x -> f(x, PreallocateArray()))

but maybe there's a better way. I'd be great if this parallel solution handled these cases

matthieugomez avatar Jul 21 '15 13:07 matthieugomez

@matthieugomez I think we should be able to safely combine preallocation and parallelization -- we just need to be sure that each worker receives its own copy of "functional object" f. Actually, I'm already using similar approach that combines well with parallelization. For my problem I define a BBF <: OptimizationProblem type. It requires a few trivial methods: fitness_scheme(::BBF), name(::BBF), numobjectives(::BBF), numdims(::BBF), search_space(::BBF), and, most importantly, fitness(x::Vector{Float64}, f::BBF) function, i.e. my f(x). In fitness(f, x) I can take advantage of whatever data or temporary arrays that are stored in BBF. The important point is that temporary array allocation happens on-demand, i.e. freshly created f::BBF object does not have any temporary arrays, so its copies could be safely distributed among the workers, and then each worker would have its own temporary memory automatically allocated when f(x) is first called.

In v0.4 it could be even simpler. You can define Base.call(f::F, x) method and then you can do f(x) without any need to create closures, which currently have poor performance in Julia. The only problem is that f would not be of type Function and e.g. Optim methods do not accept such arguments, but that should be easy to fix.

alyst avatar Jul 21 '15 13:07 alyst

Great. Thanks for the explanation.

matthieugomez avatar Jul 21 '15 14:07 matthieugomez

Suggestion: Allow multiple different WorkerMethods. That might increase robustness on some problems (since a single optimizer is never best for all problems) and might also help convergence (worse optimizers might get out of local optima by "inspiration" from a better optimizer etc). Would be very nice if one can have also non-population optimizers in there since some of them are very much quicker on "simple" problems and it is well known that DE benefits from early "help" with a few good solutions in the population.

robertfeldt avatar Jul 21 '15 14:07 robertfeldt

Definitely that would be very nice to have. But that requires more advanced communication to solve the load balancing problems, because the current scheme assumes migration rates are the same for all workers. Ultimately, it would be nice to have something like a parallelized version of Amalgam.

alyst avatar Jul 21 '15 16:07 alyst

Ok, yes.

Amalgam is very nice but parallelism on a lower level then. We should experiment with a eval parallel "block" size that we do pmap on at some point. If we are only sending numdims floats over in a vector I guess it should be ok for most but the very simple fitness functions.

robertfeldt avatar Jul 21 '15 19:07 robertfeldt

Hi

I am wondering if an MPI implementation of the above idea is welcome? if so I tried to find 'ParallelPopulationOptimizer' code but somehow I failed when forked Alexey Stukalov rep;

how can I get a copy/clone of the relevant files? best, steve

steven-varga avatar Jul 24 '15 11:07 steven-varga

@steven-varga Hi, if you want to distribute the calculations over several machines, MPI could be a way to go (you can also consider ZeroMQ or use non-local cluster managers -- the latter should require minimal modifications to the current code). For one machine I think the current approach is better, because it uses native Julia task scheduling and data messaging.

If you want to test this pull request in Julia, you can try

Pkg.add("MessageUtils")
Pkg.checkout("MessageUtils") # parallel optimizer requires the master of MessageUtils
Pkg.clone("https://github.com/alyst/BlackBoxOptim.jl.git")
Pkg.checkout("BlackBoxOptim", branch="parallel_pop_opt")

But be warned that the code is still experimental. Also I've rebased it on top of new BlackBoxOptim API, but haven't tested yet. Pls let me know if you have further questions.

alyst avatar Jul 24 '15 12:07 alyst

I've updated the code to use Channels very recently introduced into Julia 0.4-dev, so MessageUtils are no longer required. However, the "smoketest" using parallel optimizer stalls for some yet-to-be-investigated reason.

alyst avatar Aug 06 '15 20:08 alyst

Updated to use recently introduced type-parameterized RemoteRefs and parallel exception handling. The tests now pass on 0.4-nightly (the current CI failure is due to XNES smoketest randomly failed). It starts to be more usable as now there should be no stalls due to worker threads silently crashing.

alyst avatar Aug 10 '15 23:08 alyst

@alyst This can now be closed, right, since it was superseeded by the ParallelEvaluator code?

robertfeldt avatar Nov 28 '15 10:11 robertfeldt

@robertfeldt ParallelEvaluator covers methods that generate multiple candidates at once, e.g. NES. This master/slave optimizer [better] fits DE-like algorithms. So both can co-exist. I had not rebased the PR recently (now that ParallelEvaluator is in, I would do it shortly), but it's in a working state, I'm using it in combination with DE for my problems.

I would leave it as PR. At least it gives a starting point for parallelization of multiple different optimizers.

alyst avatar Nov 28 '15 10:11 alyst

Ok, yes, I remember now. Ok lets leave this as PR for now. Great if you can take a look at the example of parallel eval so we can help guide people on its use. I guess we should use a NES alg in it, then.

robertfeldt avatar Nov 28 '15 10:11 robertfeldt