Polyester.jl icon indicating copy to clipboard operation
Polyester.jl copied to clipboard

Will `@batch` effect the `Threads.@threads`?

Open wly2014 opened this issue 2 years ago • 6 comments

When I used @batch and Threads.@threads concurrently, I find a confusing problem.

using TimerOutputs
using LinearAlgebra
using Polyester

function testmul_thread(A,B,C)
    Threads.@threads for i in 1:96
        mul!(C[i],A,B)
    end
end
function testmul_batch(A,B,C)
    @batch per=core for i in 1:96
        mul!(C[i],A,B)
    end
end

@show Threads.nthreads() # 6

function testmuls()
    A = rand(100,100)
    B = rand(100,100)
    C = [rand(100,100) for i in 1:96]
    for i in 1:20
        @timeit "testmul_th" testmul_thread(A,B,C)
        @timeit "testmul_batch" testmul_batch(A,B,C)
    end
end

# first time call
testmuls();

reset_timer!()
testmuls()
show(TimerOutputs.get_defaulttimer());

The number of threads is 6. When I only call the function testmul_thread() in testmuls(), the time consumed only 94.6ms.

───────────────────────────────────────────────────────────────────────
                               Time                    Allocations      
                      ───────────────────────   ────────────────────────
   Tot / % measured:       106ms /  89.5%           7.55MiB /   0.8%

 Section      ncalls     time    %tot     avg     alloc    %tot      avg
 ───────────────────────────────────────────────────────────────────────
 testmul_th       20   94.6ms  100.0%  4.73ms   64.4KiB  100.0%  3.22KiB
 ───────────────────────────────────────────────────────────────────────

But, when I call concurrently testmul_thread() and testmul_batch(), the time consumed of testmul_thread() will increase to 417ms:

──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations      
                         ───────────────────────   ────────────────────────
    Tot / % measured:         518ms /  99.4%           7.56MiB /   0.9%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20    417ms   81.2%  20.9ms   64.9KiB   97.7%  3.25KiB
 testmul_batch       20   97.0ms   18.8%  4.85ms   1.56KiB    2.3%    80.0B
 ──────────────────────────────────────────────────────────────────────────

Why the run of function testmul_batch will effect testmul_thread? Dose @batch change something about global configuration of julia threads?

Thank you very much.

wly2014 avatar Jun 22 '22 09:06 wly2014

By the way, if I call the two functions separately as follows, the increment of time consuming of testmul_thread will be very little:

for i in 1:20
    @timeit "testmul_th" testmul_thread(A,B,C)
end
for i in 1:20
    @timeit "testmul_batch" testmul_batch(A,B,C)
end

wly2014 avatar Jun 22 '22 13:06 wly2014

This is what ThreadingUtilites.sleep_all_tasks() is for, but it does not seem to help much.

julia> using Polyester, LinearAlgebra, TimerOutputs

julia> BLAS.set_num_threads(1)

julia> function testmul_thread(A,B,C)
           Threads.@threads for i in 1:96
               mul!(C[i],A,B)
           end
       end
testmul_thread (generic function with 1 method)

julia> function testmul_batch(A,B,C)
           @batch per=core for i in 1:96
               mul!(C[i],A,B)
           end
       end
testmul_batch (generic function with 1 method)

julia> @show Threads.nthreads() 
Threads.nthreads() = 10
10

julia> function testmuls()
           A = rand(100,100)
           B = rand(100,100)
           C = [rand(100,100) for i in 1:96]
           for i in 1:20
               @timeit "testmul_th" testmul_thread(A,B,C)
               @timeit "testmul_batch" testmul_batch(A,B,C)
           end
       end
testmuls (generic function with 1 method)

julia> testmuls()

julia> reset_timer!();

julia> testmuls()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:         16.4s /   0.4%           49.9MiB /   0.2%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   52.2ms   86.1%  2.61ms    109KiB   98.6%  5.46KiB
 testmul_batch       20   8.41ms   13.9%   420μs   1.56KiB    1.4%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> using PolyesterWeave

julia> using ThreadingUtilities

julia> function testmuls2()
           A = rand(100,100)
           B = rand(100,100)
           C = [rand(100,100) for i in 1:96]
           for i in 1:20
               @timeit "testmul_th" testmul_thread(A,B,C)
               @timeit "testmul_batch" testmul_batch(A,B,C)
               ThreadingUtilities.sleep_all_tasks()
           end
       end
testmuls2 (generic function with 1 method)

julia> testmuls2()

julia> reset_timer!();

julia> testmuls2()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:         3.36s /   2.9%           7.68MiB /   1.5%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_batch       20   53.3ms   53.8%  2.66ms   1.56KiB    1.3%    80.0B
 testmul_th          20   45.8ms   46.2%  2.29ms    116KiB   98.7%  5.78KiB
 ──────────────────────────────────────────────────────────────────────────

In this example, it does not seem to help much

julia> function testmulsthread()
           A = rand(100,100)
           B = rand(100,100)
           C = [rand(100,100) for i in 1:96]
           for i in 1:20
               @timeit "testmul_th" testmul_thread(A,B,C)
               #@timeit "testmul_batch" testmul_batch(A,B,C)
               #ThreadingUtilities.sleep_all_tasks()
           end
       end
testmulsthread (generic function with 1 method)

julia> testmulsthread()

julia> reset_timer!();

julia> testmulsthread()

julia> TimerOutputs.get_defaulttimer()
 ───────────────────────────────────────────────────────────────────────
                               Time                    Allocations
                      ───────────────────────   ────────────────────────
   Tot / % measured:       4.40s /   0.7%           7.68MiB /   1.5%

 Section      ncalls     time    %tot     avg     alloc    %tot      avg
 ───────────────────────────────────────────────────────────────────────
 testmul_th       20   30.2ms  100.0%  1.51ms    116KiB  100.0%  5.82KiB
 ───────────────────────────────────────────────────────────────────────

and it hurts the performance of @batch a lot more than it helps the performance of @threads.

chriselrod avatar Jun 22 '22 16:06 chriselrod

Thank you very much. Actually, it seems to work well for me. On my laptop,

julia> versioninfo()
Julia Version 1.6.5
Commit 9058264a69 (2021-12-19 12:30 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: AMD Ryzen 7 4800U with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, znver2)

When use 6 julia threads and 1 BLAS thread, the result of function:

function testmuls()
    A = rand(100,100)
    B = rand(100,100)
    C = [rand(100,100) for i in 1:96]
    for i in 1:20
        @timeit "testmul_th" testmul_thread(A,B,C)
        # @timeit "testmul_batch" testmul_batch(A,B,C)
        # ThreadingUtilities.sleep_all_tasks()
    end
end

is

 ───────────────────────────────────────────────────────────────────────
                               Time                    Allocations      
                      ───────────────────────   ────────────────────────
   Tot / % measured:      60.5ms /  82.3%           7.55MiB /   0.8%

 Section      ncalls     time    %tot     avg     alloc    %tot      avg
 ───────────────────────────────────────────────────────────────────────
 testmul_th       20   49.8ms  100.0%  2.49ms   63.8KiB  100.0%  3.19KiB
 ───────────────────────────────────────────────────────────────────────

and the result of function

function testmuls()
    A = rand(100,100)
    B = rand(100,100)
    C = [rand(100,100) for i in 1:96]
    for i in 1:20
        # @timeit "testmul_th" testmul_thread(A,B,C)
        @timeit "testmul_batch" testmul_batch(A,B,C)
        # ThreadingUtilities.sleep_all_tasks()
    end
end

is

 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations      
                         ───────────────────────   ────────────────────────
    Tot / % measured:        68.6ms /  64.2%           7.49MiB /   0.0%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_batch       20   44.0ms  100.0%  2.20ms   1.56KiB  100.0%    80.0B
 ──────────────────────────────────────────────────────────────────────────

Finally, the result of function

function testmuls()
    A = rand(100,100)
    B = rand(100,100)
    C = [rand(100,100) for i in 1:96]
    for i in 1:20
        @timeit "testmul_th" testmul_thread(A,B,C)
        @timeit "testmul_batch" testmul_batch(A,B,C)
        ThreadingUtilities.sleep_all_tasks()
    end
end

is about

 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations      
                         ───────────────────────   ────────────────────────
    Tot / % measured:        97.0ms /  96.7%           7.56MiB /   0.8%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_batch       20   49.2ms   52.5%  2.46ms   1.56KiB    2.4%    80.0B
 testmul_th          20   44.5ms   47.5%  2.23ms   63.3KiB   97.6%  3.17KiB
 ──────────────────────────────────────────────────────────────────────────

I wonder why your result of first time run of testmul_batch is only 8.41ms, which is very different to testmul_thread's 52.2ms, but the time of two function is similar on my laptop. Thanks again.

wly2014 avatar Jun 23 '22 02:06 wly2014

I wonder why your result of first time run of testmul_batch is only 8.41ms, which is very different to testmul_thread's 52.2ms, but the time of two function is similar on my laptop. Thanks again.

Hmm, I'm not sure what went wrong earlier.

julia> using Polyester, LinearAlgebra, TimerOutputs
[ Info: Precompiling Polyester [f517fe37-dbe3-4b94-8317-1923a5111588]
ROOT                      :  3.36 %   104349936
LOWERING                  :  5.43 %   168795352
PARSING                   :  4.35 %   135114722
INFERENCE                 :  0.67 %   20892370
CODEGEN                   :  0.66 %   20442216
METHOD_LOOKUP_SLOW        :  0.01 %   265948
METHOD_LOOKUP_FAST        :  0.08 %   2344030
LLVM_OPT                  :  0.84 %   26267292
LLVM_MODULE_FINISH        :  6.51 %   202523080
METHOD_MATCH              :  7.32 %   227515996
TYPE_CACHE_LOOKUP         :  0.46 %   14337602
TYPE_CACHE_INSERT         :  0.06 %   1908924
MACRO_INVOCATION          :  0.03 %   805162
AST_COMPRESS              :  0.11 %   3455434
AST_UNCOMPRESS            :  0.11 %   3338902
SYSIMG_LOAD               :  7.43 %   231040270
ADD_METHOD                : 51.56 %   1602959664
LOAD_MODULE               :  6.18 %   192237612
SAVE_MODULE               :  3.93 %   122197110
INIT_MODULE               :  0.91 %   28354874

julia> BLAS.set_num_threads(1)

julia> function testmul_thread(A,B,C)
           Threads.@threads for i in 1:96
               mul!(C[i],A,B)
           end
       end
testmul_thread (generic function with 1 method)

julia> function testmul_batch(A,B,C)
           @batch per=core for i in 1:96
               mul!(C[i],A,B)
           end
       end
testmul_batch (generic function with 1 method)

julia> @show Threads.nthreads()
Threads.nthreads() = 10
10

julia> function testmuls()
           A = rand(100,100)
           B = rand(100,100)
           C = [rand(100,100) for i in 1:96]
           for i in 1:20
               @timeit "testmul_th" testmul_thread(A,B,C)
               @timeit "testmul_batch" testmul_batch(A,B,C)
           end
       end
testmuls (generic function with 1 method)

julia> testmuls()

julia> reset_timer!();

julia> testmuls()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:         654ms /   9.0%           49.2MiB /   0.2%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   52.2ms   89.0%  2.61ms    109KiB   98.6%  5.44KiB
 testmul_batch       20   6.47ms   11.0%   323μs   1.56KiB    1.4%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> using PolyesterWeave

julia> using ThreadingUtilities

julia> function testmuls2()
           A = rand(100,100)
           B = rand(100,100)
           C = [rand(100,100) for i in 1:96]
           for i in 1:20
               @timeit "testmul_th" testmul_thread(A,B,C)
               @timeit "testmul_batch" testmul_batch(A,B,C)
               ThreadingUtilities.sleep_all_tasks()
           end
       end
testmuls2 (generic function with 1 method)

julia> testmuls2()

julia> reset_timer!();

julia> testmuls2()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:         4.52s /   0.4%           9.63MiB /   1.2%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   8.14ms   50.1%   407μs    120KiB   98.7%  6.01KiB
 testmul_batch       20   8.10ms   49.9%   405μs   1.56KiB    1.3%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> reset_timer!();

julia> testmuls()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:         10.0s /   0.6%           8.18MiB /   1.3%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   50.3ms   83.4%  2.52ms    109KiB   98.6%  5.46KiB
 testmul_batch       20   10.0ms   16.6%   500μs   1.56KiB    1.4%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> reset_timer!();

julia> testmuls2()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:         5.98s /   0.3%           11.3MiB /   1.1%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_batch       20   9.81ms   50.4%   490μs   1.56KiB    1.3%    80.0B
 testmul_th          20   9.64ms   49.6%   482μs    120KiB   98.7%  6.02KiB
 ──────────────────────────────────────────────────────────────────────────

julia> reset_timer!();

julia> testmuls()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:        71.6ms /  84.4%           7.63MiB /   1.4%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   50.3ms   83.2%  2.51ms    109KiB   98.6%  5.46KiB
 testmul_batch       20   10.1ms   16.8%   507μs   1.56KiB    1.4%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> reset_timer!();

julia> testmuls2()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:        23.3ms /  92.0%           7.64MiB /   1.5%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   11.6ms   54.3%   581μs    119KiB   98.7%  5.95KiB
 testmul_batch       20   9.78ms   45.7%   489μs   1.56KiB    1.3%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> reset_timer!();

julia> testmuls()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:        64.9ms /  92.7%           7.63MiB /   1.4%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   50.0ms   83.1%  2.50ms    109KiB   98.6%  5.46KiB
 testmul_batch       20   10.2ms   16.9%   509μs   1.56KiB    1.4%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> reset_timer!();

julia> testmuls2()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:        25.4ms /  83.7%           7.64MiB /   1.6%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   11.5ms   54.3%   576μs    120KiB   98.7%  6.00KiB
 testmul_batch       20   9.71ms   45.7%   485μs   1.56KiB    1.3%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> reset_timer!();

julia> testmuls()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:        65.3ms /  92.5%           7.63MiB /   1.4%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   50.3ms   83.3%  2.51ms    109KiB   98.6%  5.46KiB
 testmul_batch       20   10.1ms   16.7%   504μs   1.56KiB    1.4%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> reset_timer!();

julia> testmuls2()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:        25.6ms /  83.9%           7.64MiB /   1.6%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   11.7ms   54.5%   585μs    120KiB   98.7%  6.01KiB
 testmul_batch       20   9.77ms   45.5%   489μs   1.56KiB    1.3%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> reset_timer!();

julia> testmuls()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:        67.0ms /  90.4%           7.63MiB /   1.4%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   50.3ms   83.1%  2.52ms    109KiB   98.6%  5.46KiB
 testmul_batch       20   10.2ms   16.9%   511μs   1.56KiB    1.4%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> reset_timer!();

julia> testmuls2()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:        23.7ms /  93.2%           7.64MiB /   1.6%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   12.4ms   56.0%   618μs    120KiB   98.7%  6.00KiB
 testmul_batch       20   9.70ms   44.0%   485μs   1.56KiB    1.3%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> versioninfo()
Julia Version 1.9.0-DEV.824
Commit 5a0685a8a6* (2022-06-21 07:43 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 28 × Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.3 (ORCJIT, skylake-avx512)
  Threads: 10 on 28 virtual cores

Now I see a good improvement from sleep_all_threads(), without much of a penalty in tastmul_batch.

My times are much faster because my CPU has >10 physical cores and AVX512 (14 physical cores, but I only used 10 threads so that we get 2 iterations/thread).

chriselrod avatar Jun 23 '22 02:06 chriselrod

so that we get 2 iterations/thread

May be not 2 iterations/thread? Because the for iterations of function testmul_thread and testmul_batch are 96, so each thread will get about 96/10 iterations?

wly2014 avatar Jun 23 '22 03:06 wly2014

so that we get 2 iterations/thread

May be not 2 iterations/thread? Because the for iterations of function testmul_thread and testmul_batch are 96, so each thread will get about 96/10 iterations?

Yes, 96. I should've looked at the code.

chriselrod avatar Jun 23 '22 17:06 chriselrod