nested loop example revisited - plain Threads look as fast as Dagger

Open schlichtanders opened this issue 1 month ago • 1 comments

It seems that the nested loop example should get the following implementation for Threads

using Dagger, Random, Distributions, StatsBase, DataFrames

function f(dist, len, reps, σ)
    v = Vector{Float64}(undef, len) # avoiding allocations
    maximum(mean(rand!(dist, v)) for _ in 1:reps)/σ
end

function experiments_threads(dists, lens, K=1000)
    res = DataFrame()
    @sync for T in dists
        dist = T()
        σ = Threads.@spawn std(dist)
        for L in lens
            z = Threads.@spawn f(dist, L, K, fetch(σ))
            push!(res, (;T, σ, L, z))
        end
    end
    res.z = fetch.(res.z)
    res.σ = fetch.(res.σ)
    res
end

function experiments_dagger(dists, lens, K=1000)
    res = DataFrame()
    @sync for T in dists
        dist = T()
        σ = Dagger.@spawn std(dist)
        for L in lens
            z = Dagger.@spawn f(dist, L, K, σ)
            push!(res, (;T, σ, L, z))
        end
    end
    res.z = fetch.(res.z)
    res.σ = fetch.(res.σ)
    res
end

dists =  [Cosine, Epanechnikov, Laplace, Logistic, Normal, NormalCanon, PGeneralizedGaussian, SkewNormal, SkewedExponentialPower, SymTriangularDist]
lens = [10, 20, 50, 100, 200, 500]

using BenchmarkTools
@btime experiments_dagger(dists, lens)  # slightly slower, for 6 Threads, 574.444 ms (9740771 allocations: 271.22 MiB)
@btime experiments_threads(dists, lens)  # slightly faster, for 6 Threads, 543.696 ms (9681150 allocations: 268.68 MiB)

The differences in time might be pure randomness in this case.

However, even more confusing, if I am adding additional processes up front (after clean restart of julia)

using Distributed
Distributed.addprocs(2, exeflags=`--threads=3`)

and then run the previous code, then @btime experiments_dagger(dists, lens) is not twice as fast (we added another 6 threads in total), but stays about the same in speed.

Nov 02 '25 23:11 schlichtanders

Regarding Threads: I would never expect using Threads.@spawn to be slower than Dagger - we use Threads.@spawn for our own Dagger tasks. The example in the docs was about Threads.@threads, which is the more convenient method that most people use, but in fact performs very poorly in comparison. It's a slightly contrived example, sure, but it does illustrate that parallelism isn't always composable if you aren't careful. If we had our own Threads.@threads equivalent, we'd be functionally better performing with the same gain in usability.

Regarding Distributed: Probably the overhead of moving (and serializing/deserializing) data between workers is overwhelming any gain you'd get from using them. Alternatively, the number of allocations made could be dominating any gain here. You could use Dagger's visualization capabilities to figure out what's going on: https://juliaparallel.org/Dagger.jl/dev/logging/

Nov 12 '25 17:11 jpsamaroo