nested loop example revisited - plain Threads look as fast as Dagger
It seems that the nested loop example should get the following implementation for Threads
using Dagger, Random, Distributions, StatsBase, DataFrames
function f(dist, len, reps, σ)
v = Vector{Float64}(undef, len) # avoiding allocations
maximum(mean(rand!(dist, v)) for _ in 1:reps)/σ
end
function experiments_threads(dists, lens, K=1000)
res = DataFrame()
@sync for T in dists
dist = T()
σ = Threads.@spawn std(dist)
for L in lens
z = Threads.@spawn f(dist, L, K, fetch(σ))
push!(res, (;T, σ, L, z))
end
end
res.z = fetch.(res.z)
res.σ = fetch.(res.σ)
res
end
function experiments_dagger(dists, lens, K=1000)
res = DataFrame()
@sync for T in dists
dist = T()
σ = Dagger.@spawn std(dist)
for L in lens
z = Dagger.@spawn f(dist, L, K, σ)
push!(res, (;T, σ, L, z))
end
end
res.z = fetch.(res.z)
res.σ = fetch.(res.σ)
res
end
dists = [Cosine, Epanechnikov, Laplace, Logistic, Normal, NormalCanon, PGeneralizedGaussian, SkewNormal, SkewedExponentialPower, SymTriangularDist]
lens = [10, 20, 50, 100, 200, 500]
using BenchmarkTools
@btime experiments_dagger(dists, lens) # slightly slower, for 6 Threads, 574.444 ms (9740771 allocations: 271.22 MiB)
@btime experiments_threads(dists, lens) # slightly faster, for 6 Threads, 543.696 ms (9681150 allocations: 268.68 MiB)
The differences in time might be pure randomness in this case.
However, even more confusing, if I am adding additional processes up front (after clean restart of julia)
using Distributed
Distributed.addprocs(2, exeflags=`--threads=3`)
and then run the previous code, then @btime experiments_dagger(dists, lens) is not twice as fast (we added another 6 threads in total), but stays about the same in speed.
Regarding Threads:
I would never expect using Threads.@spawn to be slower than Dagger - we use Threads.@spawn for our own Dagger tasks. The example in the docs was about Threads.@threads, which is the more convenient method that most people use, but in fact performs very poorly in comparison. It's a slightly contrived example, sure, but it does illustrate that parallelism isn't always composable if you aren't careful. If we had our own Threads.@threads equivalent, we'd be functionally better performing with the same gain in usability.
Regarding Distributed: Probably the overhead of moving (and serializing/deserializing) data between workers is overwhelming any gain you'd get from using them. Alternatively, the number of allocations made could be dominating any gain here. You could use Dagger's visualization capabilities to figure out what's going on: https://juliaparallel.org/Dagger.jl/dev/logging/