Game of Life example is slower with more processes than just one
Hi, I am looking into Dagger.jl to see if I can use it for speeding up cellular automata models. So, I was very happy to see an example of the stencil macro which implemented the game of life. However, when I test it myself with some small adjustments, it is much slower to use multiple processes than it is to just use one. Below is my code, I'm working in a jupyter notebook.
# Add workers before 'using Dagger'
using Distributed
if nprocs() == 1
addprocs(5)
end
# Before we load the packages, we need make sure the environment is loaded correctly
@everywhere begin
using Pkg; Pkg.activate("juliaenv")
end
using Plots
using Dagger
import Dagger: @stencil, Wrap, Pad
N = 27 # Size of one dimension of a tile
nt = 3 # Number of tiles in each dimension (results in nt x nt grid of tiles)
niters = 100 # Number of iterations for the animation
tiles = zeros(AutoBlocks(), Bool, 512, 512)
outputs = zeros(AutoBlocks(), Bool, 512, 512)
# Create a fun initial state (e.g., a glider and some random noise)
tiles[13, 14] = true
tiles[14, 14] = true
tiles[15, 14] = true
tiles[15, 15] = true
tiles[14, 16] = true
# Add some random noise in one of the tiles
@view(tiles[(2N+1):3N, (2N+1):3N]) .= rand(Bool, N, N)
@time anim = @animate for _ in 1:niters
Dagger.spawn_datadeps() do
@stencil begin
outputs[idx] = begin
nhood = @neighbors(tiles[idx], 1, Wrap())
neighs = sum(nhood) - tiles[idx] # Sum neighborhood, but subtract own value
if tiles[idx] && neighs < 2
0 # Dies of underpopulation
elseif tiles[idx] && neighs > 3
0 # Dies of overpopulation
elseif !tiles[idx] && neighs == 3
1 # Becomes alive by reproduction
else
tiles[idx] # Keeps its prior value
end
end
tiles[idx] = outputs[idx] # Update tiles for the next iteration
end
end
heatmap(Int.(collect(outputs))) # Generate a heatmap visualization
end
path = mp4(anim; fps=5, show_msg=true).filename # Create an animation of the heatmaps over time
To run it with only one process, I comment out
if nprocs() == 1
addprocs(5)
end
On a Macbook M3 Pro, it takes about 6 seconds to run the second cell again, while with more workers, it takes around 23. I'm using Julia 1.11.6.
Now for my question(s): Is the overhead of Dagger.jl causing the slowdown here? If yes, at what point would this Game of Life simulation benefit from Dagger.jl? When employing an even bigger grid? Or am I doing something wrong that causes the slowdown?
The issue is primarily the cost of using Distributed workers, which don't use shared memory and thus incur a large overhead from data serialization (and is heavily reliant on the scheduler doing a good job, which it sometimes does not do). For the case of running with multithreading instead (which doesn't have those problems), @pszufe and I found that performance was still not great, but that's due to bounds checking in sum and in tiles[idx] causing a ton of overhead, due to how the stencils access memory. @pszufe is planning to put together a PR so that we can properly disable boundschecking for stencils and hopefully achieve very good performance.
Thanks for your explanation! With multiple threads it is indeed faster than just one as you would expect. So if I understand correctly, when using Distributed workers the DArray is distributed between workers? But, once you need part of the array from another worker, when you are at the boundary of the block for example, that part is serialized and transferred?
I'm curious to see how much the performance will increase once the bounds checking is properly disabled.
FYI. For perfomance testing I use the code below (note you need to call @time twice as the first call is precompilation. The precompilation can be pretty long for distributed code.
I run this as
$ julia --check-bounds=no -t 10 --project=. GoL.jl
$ cat GoL.jl
using Distributed
println("Running $(nprocs()) procs")
using Dagger
import Dagger: @stencil, Wrap, Pad
N = 27 # Size of one dimension of a tile
niters = 100 # Number of iterations for the animation
function init(N)
tiles = zeros(AutoBlocks(), Bool, 2048, 2048)
outputs = zeros(AutoBlocks(), Bool, 2048, 2048)
tiles[13, 14] = true
tiles[14, 14] = true
tiles[15, 14] = true
tiles[15, 15] = true
@view(tiles[(2N+1):3N, (2N+1):3N]) .= rand(Bool, N, N)
(;tiles,outputs)
end
function run(tiles, outputs, niters)
anim_data = Int[] # just calculate live counts
for _ in 1:niters
Dagger.spawn_datadeps() do
@stencil begin
outputs[idx] = begin
nhood = @neighbors(tiles[idx], 1, Wrap())
accum = zero(Int) #zero(eltype(nhood)) # should be Int instead?
for ii in eachindex(nhood)
accum += @inbounds nhood[ii]
end
neighs = accum - @inbounds tiles[idx]
if @inbounds tiles[idx] && neighs < 2
0 # Dies of underpopulation
elseif @inbounds tiles[idx] && neighs > 3
0 # Dies of overpopulation
elseif !(@inbounds tiles[idx]) && neighs == 3
1 # Becomes alive by reproduction
else
@inbounds tiles[idx] # Keeps its prior value
end
end
tiles[idx] = @inbounds outputs[idx] # Update tiles for the next iteration
end
end
#push!(anim_data, sum(collect(outputs)))
end
anim_data
end
@time run(init(N)...,1)
@time run(init(N)...,niters)