Polyester.jl
Polyester.jl copied to clipboard
`@batch` slower than `Threads.@threads`
Tried to see if replacing Threads.@threads with @batch would improve performance on smaller workloads for a function. This seems to introduce some kind of data race (which wasn't present with Threads.@threads), which results in many NaN values and incorrect results, which were caught by the tests. The relevant loop is here. Not sure what the cause is; the for loop is looping over calculations for each data point, and each one of these should be completely independent of the calculations for every other data point.
Can you provide a minimal example I can run?
Can you provide a minimal example I can run?
I'll try, but I expect it to be quite difficult since the tests are based on comparing the results of a fairly complex calculation to those produced by another program in R; I'd have to build tests for much smaller subsets of the calculations. Replacing Threads.@threads with @batch per=thread in the loop I linked generates the problem, but I understand that's not really "Minimal."
Replacing Threads.@threads with @batch per=thread in the loop I linked generates the problem, but I understand that's not really "Minimal."
While not ideal, having code to call psis would help.
What I'd do from there is store post_sample_size, r_eff, and weights and hopefully produce a minimal example from that.
@inbounds Threads.@threads for i in eachindex(tail_length)
tail_length[i] = def_tail_length(post_sample_size, r_eff[i])
ξ[i] = @views ParetoSmooth.do_psis_i!(weights[i,:], tail_length[i])
end
A trick for storing things is defining this in your REPL:
julia> const ARGS = Ref{Any}()
and then (using Revise.jl), edit your package to say
Main.ARGS[] = copy.((post_sample_size, r_eff, weights))
@inbounds Threads.@threads for i in eachindex(tail_length)
tail_length[i] = def_tail_length(post_sample_size, r_eff[i])
ξ[i] = @views ParetoSmooth.do_psis_i!(weights[i,:], tail_length[i])
end
now, calling the function will store those into the ARGS defined in your REPL.
This is a neat trick for getting at internal data structures that may be hard to extract otherwise from deeply nested functions.
Replacing Threads.@threads with @Batch per=thread in the loop I linked generates the problem, but I understand that's not really "Minimal."
While not ideal, having code to call
psiswould help.
The tests in the test file should do what you need, in that case; the test folder contains an example log-likelihood array used for testing, as well as the saved results of the same calculations in R.
What I'd do from there is store
post_sample_size,r_eff, andweightsand hopefully produce a minimal example from that.@inbounds Threads.@threads for i in eachindex(tail_length) tail_length[i] = def_tail_length(post_sample_size, r_eff[i]) ξ[i] = @views ParetoSmooth.do_psis_i!(weights[i,:], tail_length[i]) endA trick for storing things is defining this in your REPL:
julia> const ARGS = Ref{Any}()and then (using
Revise.jl), edit your package to sayMain.ARGS[] = copy.((post_sample_size, r_eff, weights)) @inbounds Threads.@threads for i in eachindex(tail_length) tail_length[i] = def_tail_length(post_sample_size, r_eff[i]) ξ[i] = @views ParetoSmooth.do_psis_i!(weights[i,:], tail_length[i]) endnow, calling the function will store those into the
ARGSdefined in your REPL. This is a neat trick for getting at internal data structures that may be hard to extract otherwise from deeply nested functions.
I see; how do I get those to you after that?
I see; how do I get those to you after that?
If they're big, then there isn't really a convenient way. I'll look at the test suite.
I see; how do I get those to you after that?
If they're big, then there isn't really a convenient way.
Yeah, weights is pretty big (big enough to be inconvenient, at least; 32 x 1000).
Something that might be worth noting is that I tried to parallelize this with FLoops.jl as well, and it complained about "weights" being a boxed variable, which might create data races. I don't know what that means, but it might be useful to you. I figured this was just a problem for FLoops, rather than an inherent feature of the code that would produce data races, since the tests all passed when using Threads.@threads, indicating near-equality between the R outputs and Julia outputs.
I don't think that'd be a problem.
The first thing I'll look at is if Polyester/StrideArraysCore has a problem with views.
Either with the macro, or with StrideArraysCore's custom view-handling.
It'll be at least a couple hours before I can look at this.
Just as an update, I fixed a bug in StrideArraysCore so that it now gets the correct answer, but @threads is several times faster than @batch for me. I still need to look into why.
Just as an update, I fixed a bug in StrideArraysCore so that it now gets the correct answer, but
@threadsis several times faster than@batchfor me. I still need to look into why.
Cool, thanks! That's surprising -- I would have expected it to be, at worst, the same speed.
I changed the name of the issue to reflect the updated status.