Polyester.jl
Polyester.jl copied to clipboard
Will `@batch` effect the `Threads.@threads`?
When I used @batch
and Threads.@threads
concurrently, I find a confusing problem.
using TimerOutputs
using LinearAlgebra
using Polyester
function testmul_thread(A,B,C)
Threads.@threads for i in 1:96
mul!(C[i],A,B)
end
end
function testmul_batch(A,B,C)
@batch per=core for i in 1:96
mul!(C[i],A,B)
end
end
@show Threads.nthreads() # 6
function testmuls()
A = rand(100,100)
B = rand(100,100)
C = [rand(100,100) for i in 1:96]
for i in 1:20
@timeit "testmul_th" testmul_thread(A,B,C)
@timeit "testmul_batch" testmul_batch(A,B,C)
end
end
# first time call
testmuls();
reset_timer!()
testmuls()
show(TimerOutputs.get_defaulttimer());
The number of threads is 6. When I only call the function testmul_thread()
in testmuls()
, the time consumed only 94.6ms
.
───────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 106ms / 89.5% 7.55MiB / 0.8%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────
testmul_th 20 94.6ms 100.0% 4.73ms 64.4KiB 100.0% 3.22KiB
───────────────────────────────────────────────────────────────────────
But, when I call concurrently testmul_thread()
and testmul_batch()
, the time consumed of testmul_thread()
will increase to 417ms
:
──────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 518ms / 99.4% 7.56MiB / 0.9%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────
testmul_th 20 417ms 81.2% 20.9ms 64.9KiB 97.7% 3.25KiB
testmul_batch 20 97.0ms 18.8% 4.85ms 1.56KiB 2.3% 80.0B
──────────────────────────────────────────────────────────────────────────
Why the run of function testmul_batch
will effect testmul_thread
? Dose @batch
change something about global configuration of julia threads?
Thank you very much.
By the way, if I call the two functions separately as follows, the increment of time consuming of testmul_thread
will be very little:
for i in 1:20
@timeit "testmul_th" testmul_thread(A,B,C)
end
for i in 1:20
@timeit "testmul_batch" testmul_batch(A,B,C)
end
This is what ThreadingUtilites.sleep_all_tasks()
is for, but it does not seem to help much.
julia> using Polyester, LinearAlgebra, TimerOutputs
julia> BLAS.set_num_threads(1)
julia> function testmul_thread(A,B,C)
Threads.@threads for i in 1:96
mul!(C[i],A,B)
end
end
testmul_thread (generic function with 1 method)
julia> function testmul_batch(A,B,C)
@batch per=core for i in 1:96
mul!(C[i],A,B)
end
end
testmul_batch (generic function with 1 method)
julia> @show Threads.nthreads()
Threads.nthreads() = 10
10
julia> function testmuls()
A = rand(100,100)
B = rand(100,100)
C = [rand(100,100) for i in 1:96]
for i in 1:20
@timeit "testmul_th" testmul_thread(A,B,C)
@timeit "testmul_batch" testmul_batch(A,B,C)
end
end
testmuls (generic function with 1 method)
julia> testmuls()
julia> reset_timer!();
julia> testmuls()
julia> TimerOutputs.get_defaulttimer()
──────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 16.4s / 0.4% 49.9MiB / 0.2%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────
testmul_th 20 52.2ms 86.1% 2.61ms 109KiB 98.6% 5.46KiB
testmul_batch 20 8.41ms 13.9% 420μs 1.56KiB 1.4% 80.0B
──────────────────────────────────────────────────────────────────────────
julia> using PolyesterWeave
julia> using ThreadingUtilities
julia> function testmuls2()
A = rand(100,100)
B = rand(100,100)
C = [rand(100,100) for i in 1:96]
for i in 1:20
@timeit "testmul_th" testmul_thread(A,B,C)
@timeit "testmul_batch" testmul_batch(A,B,C)
ThreadingUtilities.sleep_all_tasks()
end
end
testmuls2 (generic function with 1 method)
julia> testmuls2()
julia> reset_timer!();
julia> testmuls2()
julia> TimerOutputs.get_defaulttimer()
──────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 3.36s / 2.9% 7.68MiB / 1.5%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────
testmul_batch 20 53.3ms 53.8% 2.66ms 1.56KiB 1.3% 80.0B
testmul_th 20 45.8ms 46.2% 2.29ms 116KiB 98.7% 5.78KiB
──────────────────────────────────────────────────────────────────────────
In this example, it does not seem to help much
julia> function testmulsthread()
A = rand(100,100)
B = rand(100,100)
C = [rand(100,100) for i in 1:96]
for i in 1:20
@timeit "testmul_th" testmul_thread(A,B,C)
#@timeit "testmul_batch" testmul_batch(A,B,C)
#ThreadingUtilities.sleep_all_tasks()
end
end
testmulsthread (generic function with 1 method)
julia> testmulsthread()
julia> reset_timer!();
julia> testmulsthread()
julia> TimerOutputs.get_defaulttimer()
───────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 4.40s / 0.7% 7.68MiB / 1.5%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────
testmul_th 20 30.2ms 100.0% 1.51ms 116KiB 100.0% 5.82KiB
───────────────────────────────────────────────────────────────────────
and it hurts the performance of @batch
a lot more than it helps the performance of @threads
.
Thank you very much. Actually, it seems to work well for me. On my laptop,
julia> versioninfo()
Julia Version 1.6.5
Commit 9058264a69 (2021-12-19 12:30 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: AMD Ryzen 7 4800U with Radeon Graphics
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, znver2)
When use 6
julia threads and 1 BLAS thread, the result of function:
function testmuls()
A = rand(100,100)
B = rand(100,100)
C = [rand(100,100) for i in 1:96]
for i in 1:20
@timeit "testmul_th" testmul_thread(A,B,C)
# @timeit "testmul_batch" testmul_batch(A,B,C)
# ThreadingUtilities.sleep_all_tasks()
end
end
is
───────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 60.5ms / 82.3% 7.55MiB / 0.8%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────
testmul_th 20 49.8ms 100.0% 2.49ms 63.8KiB 100.0% 3.19KiB
───────────────────────────────────────────────────────────────────────
and the result of function
function testmuls()
A = rand(100,100)
B = rand(100,100)
C = [rand(100,100) for i in 1:96]
for i in 1:20
# @timeit "testmul_th" testmul_thread(A,B,C)
@timeit "testmul_batch" testmul_batch(A,B,C)
# ThreadingUtilities.sleep_all_tasks()
end
end
is
──────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 68.6ms / 64.2% 7.49MiB / 0.0%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────
testmul_batch 20 44.0ms 100.0% 2.20ms 1.56KiB 100.0% 80.0B
──────────────────────────────────────────────────────────────────────────
Finally, the result of function
function testmuls()
A = rand(100,100)
B = rand(100,100)
C = [rand(100,100) for i in 1:96]
for i in 1:20
@timeit "testmul_th" testmul_thread(A,B,C)
@timeit "testmul_batch" testmul_batch(A,B,C)
ThreadingUtilities.sleep_all_tasks()
end
end
is about
──────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 97.0ms / 96.7% 7.56MiB / 0.8%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────
testmul_batch 20 49.2ms 52.5% 2.46ms 1.56KiB 2.4% 80.0B
testmul_th 20 44.5ms 47.5% 2.23ms 63.3KiB 97.6% 3.17KiB
──────────────────────────────────────────────────────────────────────────
I wonder why your result of first time run of testmul_batch
is only 8.41ms
, which is very different to testmul_thread
's 52.2ms
, but the time of two function is similar on my laptop.
Thanks again.
I wonder why your result of first time run of
testmul_batch
is only8.41ms
, which is very different totestmul_thread
's52.2ms
, but the time of two function is similar on my laptop. Thanks again.
Hmm, I'm not sure what went wrong earlier.
julia> using Polyester, LinearAlgebra, TimerOutputs
[ Info: Precompiling Polyester [f517fe37-dbe3-4b94-8317-1923a5111588]
ROOT : 3.36 % 104349936
LOWERING : 5.43 % 168795352
PARSING : 4.35 % 135114722
INFERENCE : 0.67 % 20892370
CODEGEN : 0.66 % 20442216
METHOD_LOOKUP_SLOW : 0.01 % 265948
METHOD_LOOKUP_FAST : 0.08 % 2344030
LLVM_OPT : 0.84 % 26267292
LLVM_MODULE_FINISH : 6.51 % 202523080
METHOD_MATCH : 7.32 % 227515996
TYPE_CACHE_LOOKUP : 0.46 % 14337602
TYPE_CACHE_INSERT : 0.06 % 1908924
MACRO_INVOCATION : 0.03 % 805162
AST_COMPRESS : 0.11 % 3455434
AST_UNCOMPRESS : 0.11 % 3338902
SYSIMG_LOAD : 7.43 % 231040270
ADD_METHOD : 51.56 % 1602959664
LOAD_MODULE : 6.18 % 192237612
SAVE_MODULE : 3.93 % 122197110
INIT_MODULE : 0.91 % 28354874
julia> BLAS.set_num_threads(1)
julia> function testmul_thread(A,B,C)
Threads.@threads for i in 1:96
mul!(C[i],A,B)
end
end
testmul_thread (generic function with 1 method)
julia> function testmul_batch(A,B,C)
@batch per=core for i in 1:96
mul!(C[i],A,B)
end
end
testmul_batch (generic function with 1 method)
julia> @show Threads.nthreads()
Threads.nthreads() = 10
10
julia> function testmuls()
A = rand(100,100)
B = rand(100,100)
C = [rand(100,100) for i in 1:96]
for i in 1:20
@timeit "testmul_th" testmul_thread(A,B,C)
@timeit "testmul_batch" testmul_batch(A,B,C)
end
end
testmuls (generic function with 1 method)
julia> testmuls()
julia> reset_timer!();
julia> testmuls()
julia> TimerOutputs.get_defaulttimer()
──────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 654ms / 9.0% 49.2MiB / 0.2%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────
testmul_th 20 52.2ms 89.0% 2.61ms 109KiB 98.6% 5.44KiB
testmul_batch 20 6.47ms 11.0% 323μs 1.56KiB 1.4% 80.0B
──────────────────────────────────────────────────────────────────────────
julia> using PolyesterWeave
julia> using ThreadingUtilities
julia> function testmuls2()
A = rand(100,100)
B = rand(100,100)
C = [rand(100,100) for i in 1:96]
for i in 1:20
@timeit "testmul_th" testmul_thread(A,B,C)
@timeit "testmul_batch" testmul_batch(A,B,C)
ThreadingUtilities.sleep_all_tasks()
end
end
testmuls2 (generic function with 1 method)
julia> testmuls2()
julia> reset_timer!();
julia> testmuls2()
julia> TimerOutputs.get_defaulttimer()
──────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 4.52s / 0.4% 9.63MiB / 1.2%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────
testmul_th 20 8.14ms 50.1% 407μs 120KiB 98.7% 6.01KiB
testmul_batch 20 8.10ms 49.9% 405μs 1.56KiB 1.3% 80.0B
──────────────────────────────────────────────────────────────────────────
julia> reset_timer!();
julia> testmuls()
julia> TimerOutputs.get_defaulttimer()
──────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 10.0s / 0.6% 8.18MiB / 1.3%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────
testmul_th 20 50.3ms 83.4% 2.52ms 109KiB 98.6% 5.46KiB
testmul_batch 20 10.0ms 16.6% 500μs 1.56KiB 1.4% 80.0B
──────────────────────────────────────────────────────────────────────────
julia> reset_timer!();
julia> testmuls2()
julia> TimerOutputs.get_defaulttimer()
──────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 5.98s / 0.3% 11.3MiB / 1.1%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────
testmul_batch 20 9.81ms 50.4% 490μs 1.56KiB 1.3% 80.0B
testmul_th 20 9.64ms 49.6% 482μs 120KiB 98.7% 6.02KiB
──────────────────────────────────────────────────────────────────────────
julia> reset_timer!();
julia> testmuls()
julia> TimerOutputs.get_defaulttimer()
──────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 71.6ms / 84.4% 7.63MiB / 1.4%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────
testmul_th 20 50.3ms 83.2% 2.51ms 109KiB 98.6% 5.46KiB
testmul_batch 20 10.1ms 16.8% 507μs 1.56KiB 1.4% 80.0B
──────────────────────────────────────────────────────────────────────────
julia> reset_timer!();
julia> testmuls2()
julia> TimerOutputs.get_defaulttimer()
──────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 23.3ms / 92.0% 7.64MiB / 1.5%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────
testmul_th 20 11.6ms 54.3% 581μs 119KiB 98.7% 5.95KiB
testmul_batch 20 9.78ms 45.7% 489μs 1.56KiB 1.3% 80.0B
──────────────────────────────────────────────────────────────────────────
julia> reset_timer!();
julia> testmuls()
julia> TimerOutputs.get_defaulttimer()
──────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 64.9ms / 92.7% 7.63MiB / 1.4%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────
testmul_th 20 50.0ms 83.1% 2.50ms 109KiB 98.6% 5.46KiB
testmul_batch 20 10.2ms 16.9% 509μs 1.56KiB 1.4% 80.0B
──────────────────────────────────────────────────────────────────────────
julia> reset_timer!();
julia> testmuls2()
julia> TimerOutputs.get_defaulttimer()
──────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 25.4ms / 83.7% 7.64MiB / 1.6%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────
testmul_th 20 11.5ms 54.3% 576μs 120KiB 98.7% 6.00KiB
testmul_batch 20 9.71ms 45.7% 485μs 1.56KiB 1.3% 80.0B
──────────────────────────────────────────────────────────────────────────
julia> reset_timer!();
julia> testmuls()
julia> TimerOutputs.get_defaulttimer()
──────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 65.3ms / 92.5% 7.63MiB / 1.4%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────
testmul_th 20 50.3ms 83.3% 2.51ms 109KiB 98.6% 5.46KiB
testmul_batch 20 10.1ms 16.7% 504μs 1.56KiB 1.4% 80.0B
──────────────────────────────────────────────────────────────────────────
julia> reset_timer!();
julia> testmuls2()
julia> TimerOutputs.get_defaulttimer()
──────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 25.6ms / 83.9% 7.64MiB / 1.6%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────
testmul_th 20 11.7ms 54.5% 585μs 120KiB 98.7% 6.01KiB
testmul_batch 20 9.77ms 45.5% 489μs 1.56KiB 1.3% 80.0B
──────────────────────────────────────────────────────────────────────────
julia> reset_timer!();
julia> testmuls()
julia> TimerOutputs.get_defaulttimer()
──────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 67.0ms / 90.4% 7.63MiB / 1.4%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────
testmul_th 20 50.3ms 83.1% 2.52ms 109KiB 98.6% 5.46KiB
testmul_batch 20 10.2ms 16.9% 511μs 1.56KiB 1.4% 80.0B
──────────────────────────────────────────────────────────────────────────
julia> reset_timer!();
julia> testmuls2()
julia> TimerOutputs.get_defaulttimer()
──────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 23.7ms / 93.2% 7.64MiB / 1.6%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────
testmul_th 20 12.4ms 56.0% 618μs 120KiB 98.7% 6.00KiB
testmul_batch 20 9.70ms 44.0% 485μs 1.56KiB 1.3% 80.0B
──────────────────────────────────────────────────────────────────────────
julia> versioninfo()
Julia Version 1.9.0-DEV.824
Commit 5a0685a8a6* (2022-06-21 07:43 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 28 × Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.3 (ORCJIT, skylake-avx512)
Threads: 10 on 28 virtual cores
Now I see a good improvement from sleep_all_threads()
, without much of a penalty in tastmul_batch
.
My times are much faster because my CPU has >10 physical cores and AVX512 (14 physical cores, but I only used 10 threads so that we get 2 iterations/thread).
so that we get 2 iterations/thread
May be not 2 iterations/thread? Because the for
iterations of function testmul_thread
and testmul_batch
are 96
, so each thread will get about 96/10 iterations?
so that we get 2 iterations/thread
May be not 2 iterations/thread? Because the
for
iterations of functiontestmul_thread
andtestmul_batch
are96
, so each thread will get about 96/10 iterations?
Yes, 96. I should've looked at the code.