Dagger.jl Game of Life example is slower with more processes than just one

Hi, I am looking into Dagger.jl to see if I can use it for speeding up cellular automata models. So, I was very happy to see an example of the stencil macro which implemented the game of life. However, when I test it myself with some small adjustments, it is much slower to use multiple processes than it is to just use one. Below is my code, I'm working in a jupyter notebook.

# Add workers before 'using Dagger'
using Distributed

if nprocs() == 1
    addprocs(5)
end

# Before we load the packages, we need make sure the environment is loaded correctly
@everywhere begin 
    using Pkg; Pkg.activate("juliaenv")
end

using Plots
using Dagger
import Dagger: @stencil, Wrap, Pad

N = 27 # Size of one dimension of a tile
nt = 3 # Number of tiles in each dimension (results in nt x nt grid of tiles)
niters = 100 # Number of iterations for the animation

tiles = zeros(AutoBlocks(), Bool, 512, 512)
outputs = zeros(AutoBlocks(), Bool, 512, 512)

# Create a fun initial state (e.g., a glider and some random noise)
tiles[13, 14] = true
tiles[14, 14] = true
tiles[15, 14] = true
tiles[15, 15] = true
tiles[14, 16] = true
# Add some random noise in one of the tiles
@view(tiles[(2N+1):3N, (2N+1):3N]) .= rand(Bool, N, N)

@time anim = @animate for _ in 1:niters
    Dagger.spawn_datadeps() do
        @stencil begin
            outputs[idx] = begin
                nhood = @neighbors(tiles[idx], 1, Wrap())
                neighs = sum(nhood) - tiles[idx] # Sum neighborhood, but subtract own value
                if tiles[idx] && neighs < 2
                    0 # Dies of underpopulation
                elseif tiles[idx] && neighs > 3
                    0 # Dies of overpopulation
                elseif !tiles[idx] && neighs == 3
                    1 # Becomes alive by reproduction
                else
                    tiles[idx] # Keeps its prior value
                end
            end
            tiles[idx] = outputs[idx] # Update tiles for the next iteration
        end
    end
    heatmap(Int.(collect(outputs))) # Generate a heatmap visualization
end
path = mp4(anim; fps=5, show_msg=true).filename # Create an animation of the heatmaps over time

To run it with only one process, I comment out

if nprocs() == 1
    addprocs(5)
end

On a Macbook M3 Pro, it takes about 6 seconds to run the second cell again, while with more workers, it takes around 23. I'm using Julia 1.11.6.

Now for my question(s): Is the overhead of Dagger.jl causing the slowdown here? If yes, at what point would this Game of Life simulation benefit from Dagger.jl? When employing an even bigger grid? Or am I doing something wrong that causes the slowdown?

Aug 05 '25 13:08 josia-pool

The issue is primarily the cost of using Distributed workers, which don't use shared memory and thus incur a large overhead from data serialization (and is heavily reliant on the scheduler doing a good job, which it sometimes does not do). For the case of running with multithreading instead (which doesn't have those problems), @pszufe and I found that performance was still not great, but that's due to bounds checking in sum and in tiles[idx] causing a ton of overhead, due to how the stencils access memory. @pszufe is planning to put together a PR so that we can properly disable boundschecking for stencils and hopefully achieve very good performance.

Aug 05 '25 23:08 jpsamaroo

Thanks for your explanation! With multiple threads it is indeed faster than just one as you would expect. So if I understand correctly, when using Distributed workers the DArray is distributed between workers? But, once you need part of the array from another worker, when you are at the boundary of the block for example, that part is serialized and transferred?

I'm curious to see how much the performance will increase once the bounds checking is properly disabled.

Aug 06 '25 09:08 josia-pool

FYI. For perfomance testing I use the code below (note you need to call @time twice as the first call is precompilation. The precompilation can be pretty long for distributed code.

I run this as

$ julia --check-bounds=no -t 10 --project=. GoL.jl

$ cat GoL.jl

using Distributed

println("Running $(nprocs()) procs")

using Dagger
import Dagger: @stencil, Wrap, Pad

N = 27 # Size of one dimension of a tile
niters = 100 # Number of iterations for the animation

function init(N)
    tiles = zeros(AutoBlocks(), Bool, 2048, 2048)
    outputs = zeros(AutoBlocks(), Bool, 2048, 2048)
    tiles[13, 14] = true
    tiles[14, 14] = true
    tiles[15, 14] = true
    tiles[15, 15] = true
    @view(tiles[(2N+1):3N, (2N+1):3N]) .= rand(Bool, N, N)
    (;tiles,outputs)
end
function run(tiles, outputs, niters)
    anim_data = Int[] # just calculate live counts
    for _ in 1:niters
        Dagger.spawn_datadeps() do
            @stencil begin
                outputs[idx] = begin
                    nhood = @neighbors(tiles[idx], 1, Wrap())
                    accum = zero(Int) #zero(eltype(nhood)) # should be Int instead?
                    for ii in eachindex(nhood)
                        accum += @inbounds nhood[ii]
                    end
                    neighs = accum - @inbounds tiles[idx]
                    if @inbounds tiles[idx] && neighs < 2
                        0 # Dies of underpopulation
                    elseif @inbounds tiles[idx] && neighs > 3
                        0 # Dies of overpopulation
                    elseif !(@inbounds tiles[idx]) && neighs == 3
                        1 # Becomes alive by reproduction
                    else
                        @inbounds tiles[idx] # Keeps its prior value
                    end
                end
                tiles[idx] = @inbounds outputs[idx] # Update tiles for the next iteration
            end
        end
        #push!(anim_data, sum(collect(outputs)))
    end
    anim_data
end
@time run(init(N)...,1)
@time run(init(N)...,niters)

Aug 06 '25 15:08 pszufe