Polyester.jl icon indicating copy to clipboard operation
Polyester.jl copied to clipboard

`@batch` slower than `Threads.@threads`

Open ParadaCarleton opened this issue 4 years ago • 13 comments

Tried to see if replacing Threads.@threads with @batch would improve performance on smaller workloads for a function. This seems to introduce some kind of data race (which wasn't present with Threads.@threads), which results in many NaN values and incorrect results, which were caught by the tests. The relevant loop is here. Not sure what the cause is; the for loop is looping over calculations for each data point, and each one of these should be completely independent of the calculations for every other data point.

ParadaCarleton avatar Jul 15 '21 15:07 ParadaCarleton

Can you provide a minimal example I can run?

chriselrod avatar Jul 15 '21 15:07 chriselrod

Can you provide a minimal example I can run?

I'll try, but I expect it to be quite difficult since the tests are based on comparing the results of a fairly complex calculation to those produced by another program in R; I'd have to build tests for much smaller subsets of the calculations. Replacing Threads.@threads with @batch per=thread in the loop I linked generates the problem, but I understand that's not really "Minimal."

ParadaCarleton avatar Jul 15 '21 16:07 ParadaCarleton

Replacing Threads.@threads with @batch per=thread in the loop I linked generates the problem, but I understand that's not really "Minimal."

While not ideal, having code to call psis would help.

chriselrod avatar Jul 15 '21 16:07 chriselrod

What I'd do from there is store post_sample_size, r_eff, and weights and hopefully produce a minimal example from that.

    @inbounds Threads.@threads for i in eachindex(tail_length)
        tail_length[i] = def_tail_length(post_sample_size, r_eff[i])
        ξ[i] = @views ParetoSmooth.do_psis_i!(weights[i,:], tail_length[i])
    end

A trick for storing things is defining this in your REPL:

julia> const ARGS = Ref{Any}()

and then (using Revise.jl), edit your package to say

    Main.ARGS[] = copy.((post_sample_size, r_eff, weights))
    @inbounds Threads.@threads for i in eachindex(tail_length)
        tail_length[i] = def_tail_length(post_sample_size, r_eff[i])
        ξ[i] = @views ParetoSmooth.do_psis_i!(weights[i,:], tail_length[i])
    end

now, calling the function will store those into the ARGS defined in your REPL. This is a neat trick for getting at internal data structures that may be hard to extract otherwise from deeply nested functions.

chriselrod avatar Jul 15 '21 16:07 chriselrod

Replacing Threads.@threads with @Batch per=thread in the loop I linked generates the problem, but I understand that's not really "Minimal."

While not ideal, having code to call psis would help.

The tests in the test file should do what you need, in that case; the test folder contains an example log-likelihood array used for testing, as well as the saved results of the same calculations in R.

ParadaCarleton avatar Jul 15 '21 16:07 ParadaCarleton

What I'd do from there is store post_sample_size, r_eff, and weights and hopefully produce a minimal example from that.

    @inbounds Threads.@threads for i in eachindex(tail_length)
        tail_length[i] = def_tail_length(post_sample_size, r_eff[i])
        ξ[i] = @views ParetoSmooth.do_psis_i!(weights[i,:], tail_length[i])
    end

A trick for storing things is defining this in your REPL:

julia> const ARGS = Ref{Any}()

and then (using Revise.jl), edit your package to say

    Main.ARGS[] = copy.((post_sample_size, r_eff, weights))
    @inbounds Threads.@threads for i in eachindex(tail_length)
        tail_length[i] = def_tail_length(post_sample_size, r_eff[i])
        ξ[i] = @views ParetoSmooth.do_psis_i!(weights[i,:], tail_length[i])
    end

now, calling the function will store those into the ARGS defined in your REPL. This is a neat trick for getting at internal data structures that may be hard to extract otherwise from deeply nested functions.

I see; how do I get those to you after that?

ParadaCarleton avatar Jul 15 '21 16:07 ParadaCarleton

I see; how do I get those to you after that?

If they're big, then there isn't really a convenient way. I'll look at the test suite.

chriselrod avatar Jul 15 '21 16:07 chriselrod

I see; how do I get those to you after that?

If they're big, then there isn't really a convenient way.

Yeah, weights is pretty big (big enough to be inconvenient, at least; 32 x 1000).

ParadaCarleton avatar Jul 15 '21 16:07 ParadaCarleton

Something that might be worth noting is that I tried to parallelize this with FLoops.jl as well, and it complained about "weights" being a boxed variable, which might create data races. I don't know what that means, but it might be useful to you. I figured this was just a problem for FLoops, rather than an inherent feature of the code that would produce data races, since the tests all passed when using Threads.@threads, indicating near-equality between the R outputs and Julia outputs.

ParadaCarleton avatar Jul 15 '21 16:07 ParadaCarleton

I don't think that'd be a problem.

The first thing I'll look at is if Polyester/StrideArraysCore has a problem with views. Either with the macro, or with StrideArraysCore's custom view-handling.

It'll be at least a couple hours before I can look at this.

chriselrod avatar Jul 15 '21 16:07 chriselrod

Just as an update, I fixed a bug in StrideArraysCore so that it now gets the correct answer, but @threads is several times faster than @batch for me. I still need to look into why.

chriselrod avatar Jul 18 '21 06:07 chriselrod

Just as an update, I fixed a bug in StrideArraysCore so that it now gets the correct answer, but @threads is several times faster than @batch for me. I still need to look into why.

Cool, thanks! That's surprising -- I would have expected it to be, at worst, the same speed.

ParadaCarleton avatar Jul 18 '21 16:07 ParadaCarleton

I changed the name of the issue to reflect the updated status.

chriselrod avatar Aug 28 '21 16:08 chriselrod