JuLIP.jl
JuLIP.jl copied to clipboard
Redesign Multi-threading
cc @tjjarvinen
I had a request from Noam Bernstein to allow setting the number of threads in each call to forces, etc
Adding comments on the implementation
1
From Julia documentation
The value of threadid() may change even within a single iteration.
So, using it to get address to temporary data is a possible source of bugs!
The correct way to implement temporal data, is to use Channels. Create a Channel that holds temporary data. At the beginning of each iteration pull a data instance from the Channel. Once the task is done, put the instance back into the Channel.
2
Setting number of threads for each call to forces needs a special implementation. FoldsThreads.jl has TaskPoolEx executor that can do it.
This means of using either Folds.jl or FLoops.jl. The advantage here is that there are several different executors available that have different advantages. This has several potential use cases. Like, we could use DistributedEx to distribute the force calculation.
On 2 - this sounds great, I'm looking forward to learning more.
On 1 - this is a bit cryptic for me, let me see if I understand it correctly: the problem is that a thread could be paused in the middle of executing the body of a for-loop and then the same thread could start a new iteration of the for loop body. So now I'm trying to write into the same temporary arrays at the same time.
Correct? This would indeed be entirely unexpected for me and yes that could cause plenty of bugs. But I don't like the consequence of this which seems to be that I need far more temporary arrays than threads. And we will end up allocating a lot, no?
Are the Channels you are referring to the same as what I call ObjectPools? Or is this something else entirely?
I now also think, if I understood you correctly in 1, then this means my temporary arrays in ObjectPools.jl have the same bug.
threadid() - was not an issue with earlier Julia versions. It became an issue when default scheduling was changed from static to dynamic.
Basically what can happen is that:
- thread id=1 starts a task
- at the beginning the task gets
threadid()==1 - the task execution is given to other thread (kernel scheduling, some IO wait time etc.)
- thread id=2 continues the task
- thread id=1 is given an other task
- the other task gets
threadid()==1and now there are two threads that access the same data
Other problem here is that the behaviour can depend on Julia versions, so there is no trusting on what can be done with it.
Channels
Channels are Julia construct for communicating between task. The documentation is on asynchronous computing section, but Channels are completely thread safe and can be used with threads. There is also RemoteChannel for distributed computing.
Here is an example of how to use Channels:
julia> c = Channel(4)
Channel{Any}(4) (empty)
julia> for i in 1:4
put!(c,i)
end
julia> Threads.@threads for i in 1:8
t = take!(c)
println(i, " ", Threads.threadid(), ",", t)
put!(c, t)
sleep(rand())
end
1 3,3
3 2,2
7 4,4
5 1,1
8 4,3
2 3,2
6 3,4
4 2,1
So, allocations are done only once and only same number as (worker) threads are allocated.
In fact, now that I think about it, we should probably replace ObjectPools.jl with Channels too. It would be more efficient.
Thank you for the explanation.
Since when has this dynamic scheduling started ? And is there a way to avoid it? I'll need to send around a warning to people who are using multi-threading with ACE / JuLIP.
EDIT: I see that I simply need to tag new versions with the :static keyword inserted. Do you agree this will fix things temporarily?
Yes, it will work as a temporary fix.
Since when has this dynamic scheduling started ?
I think it was either v1.6 or v1.7 that introduced Dynamical load balancing.
I tested the Channel overhead
julia> using BenchmarkTools
julia> N = 100
100
julia> c = Channel(N)
Channel{Any}(100) (empty)
julia> for i in 1:N; put!(c, rand(3)) end
julia> b = @benchmarkable take!(c) samples=N
Benchmark(evals=1, seconds=5.0, samples=100)
julia> run(b)
BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max): 41.000 ns … 1.070 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 83.000 ns ┊ GC (median): 0.00%
Time (mean ± σ): 10.798 μs ± 107.003 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█
█▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄
41 ns Histogram: log(frequency) by time 3.5 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> p = @benchmarkable put!(c, rand(3)) samples=N
Benchmark(evals=1, seconds=5.0, samples=100)
julia> run(p)
BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max): 83.000 ns … 14.083 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 84.000 ns ┊ GC (median): 0.00%
Time (mean ± σ): 275.830 ns ± 1.441 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█
█▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄
83 ns Histogram: log(frequency) by time 3.71 μs <
Memory estimate: 80 bytes, allocs estimate: 1.
For me this sounds as an acceptable overhead.
Adding a threaded version to above
julia> function f(c,n)
Threads.@threads for i in 1:n
t = take!(c)
put!(c,t)
end
end
f (generic function with 1 method)
julia> @benchmark f(c,N)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 15.875 μs … 1.495 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 47.417 μs ┊ GC (median): 0.00%
Time (mean ± σ): 56.133 μs ± 45.093 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▅▇▅▄▅▇█▅▄▃▂
▁▃▆▆▇████████████▇▇▅▅▄▄▄▅▆▅▅▅▄▄▄▄▃▃▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
15.9 μs Histogram: frequency by time 158 μs <
Memory estimate: 2.14 KiB, allocs estimate: 25.
- 4 threads
- 100 iterations
- So divide the time with 25
- So around 1-2 μs latency
- But this is unrealistic test, as no work is done and thus threads are only waiting to access the Channel
- Dividing by 100 would be more realistic -> ~400 ns latency
thanks for this, I'll think about how a realistic test might look like.
coming back to this - there are three scenarios:
- expensive, possibly batched MLIPs : that kind of overhead would cost us nothing ...
- cheap MLIPs targeting 1-10 us latency, especially for bio : unclear to me?
- ultra-cheap potentials in the 30-100ns / atom range. For those this overhead sounds potentially problematic
To test this why not try it out by multi-threading the evaluation of an EAM or SW potential? That's already in JuLIP. This is the worst-case scenario and if that scales well then I think we are fine.
In the meantime I'll finish a prototype of the new ACE evaluator, and you can then try it out with that as well?