JuLIP.jl Redesign Multi-threading

cc @tjjarvinen

Dec 14 '22 09:12 cortner

I had a request from Noam Bernstein to allow setting the number of threads in each call to forces, etc

Dec 14 '22 18:12 cortner

Adding comments on the implementation

1

From Julia documentation

The value of threadid() may change even within a single iteration.

So, using it to get address to temporary data is a possible source of bugs!

The correct way to implement temporal data, is to use Channels. Create a Channel that holds temporary data. At the beginning of each iteration pull a data instance from the Channel. Once the task is done, put the instance back into the Channel.

2

Setting number of threads for each call to forces needs a special implementation. FoldsThreads.jl has TaskPoolEx executor that can do it.

This means of using either Folds.jl or FLoops.jl. The advantage here is that there are several different executors available that have different advantages. This has several potential use cases. Like, we could use DistributedEx to distribute the force calculation.

Dec 14 '22 20:12 tjjarvinen

On 2 - this sounds great, I'm looking forward to learning more.

On 1 - this is a bit cryptic for me, let me see if I understand it correctly: the problem is that a thread could be paused in the middle of executing the body of a for-loop and then the same thread could start a new iteration of the for loop body. So now I'm trying to write into the same temporary arrays at the same time.

Correct? This would indeed be entirely unexpected for me and yes that could cause plenty of bugs. But I don't like the consequence of this which seems to be that I need far more temporary arrays than threads. And we will end up allocating a lot, no?

Are the Channels you are referring to the same as what I call ObjectPools? Or is this something else entirely?

Dec 14 '22 21:12 cortner

I now also think, if I understood you correctly in 1, then this means my temporary arrays in ObjectPools.jl have the same bug.

Dec 14 '22 21:12 cortner

threadid() - was not an issue with earlier Julia versions. It became an issue when default scheduling was changed from static to dynamic.

Basically what can happen is that:

thread id=1 starts a task
at the beginning the task gets threadid()==1
the task execution is given to other thread (kernel scheduling, some IO wait time etc.)
thread id=2 continues the task
thread id=1 is given an other task
the other task gets threadid()==1 and now there are two threads that access the same data

Other problem here is that the behaviour can depend on Julia versions, so there is no trusting on what can be done with it.

Channels

Channels are Julia construct for communicating between task. The documentation is on asynchronous computing section, but Channels are completely thread safe and can be used with threads. There is also RemoteChannel for distributed computing.

Here is an example of how to use Channels:

julia> c = Channel(4)
Channel{Any}(4) (empty)

julia> for i in 1:4
         put!(c,i)
       end

julia> Threads.@threads for i in 1:8
              t = take!(c)
              println(i, " ", Threads.threadid(), ",", t)
              put!(c, t)
              sleep(rand())
       end
1 3,3
3 2,2
7 4,4
5 1,1
8 4,3
2 3,2
6 3,4
4 2,1

So, allocations are done only once and only same number as (worker) threads are allocated.

In fact, now that I think about it, we should probably replace ObjectPools.jl with Channels too. It would be more efficient.

Dec 15 '22 04:12 tjjarvinen

Thank you for the explanation.

Dec 15 '22 06:12 cortner

Since when has this dynamic scheduling started ? And is there a way to avoid it? I'll need to send around a warning to people who are using multi-threading with ACE / JuLIP.

EDIT: I see that I simply need to tag new versions with the :static keyword inserted. Do you agree this will fix things temporarily?

Dec 15 '22 07:12 cortner

Yes, it will work as a temporary fix.

Dec 15 '22 20:12 tjjarvinen

Since when has this dynamic scheduling started ?

I think it was either v1.6 or v1.7 that introduced Dynamical load balancing.

Dec 15 '22 20:12 tjjarvinen

I tested the Channel overhead

julia> using BenchmarkTools

julia> N = 100
100

julia> c = Channel(N)
Channel{Any}(100) (empty)

julia> for i in 1:N; put!(c, rand(3)) end

julia> b = @benchmarkable take!(c) samples=N
Benchmark(evals=1, seconds=5.0, samples=100)

julia> run(b)
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):  41.000 ns …   1.070 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     83.000 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.798 μs ± 107.003 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                             
  █▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄
  41 ns         Histogram: log(frequency) by time       3.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> p = @benchmarkable put!(c, rand(3)) samples=N
Benchmark(evals=1, seconds=5.0, samples=100)

julia> run(p)
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):   83.000 ns … 14.083 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      84.000 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   275.830 ns ±  1.441 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                             
  █▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄
  83 ns         Histogram: log(frequency) by time      3.71 μs <

 Memory estimate: 80 bytes, allocs estimate: 1.

For me this sounds as an acceptable overhead.

Dec 16 '22 21:12 tjjarvinen

Adding a threaded version to above

julia> function f(c,n)
               Threads.@threads for i in 1:n
                       t = take!(c)
                       put!(c,t)
               end
       end
f (generic function with 1 method)

julia> @benchmark f(c,N)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  15.875 μs …  1.495 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     47.417 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   56.133 μs ± 45.093 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▃▅▇▅▄▅▇█▅▄▃▂                                            
  ▁▃▆▆▇████████████▇▇▅▅▄▄▄▅▆▅▅▅▄▄▄▄▃▃▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  15.9 μs         Histogram: frequency by time         158 μs <

 Memory estimate: 2.14 KiB, allocs estimate: 25.

4 threads
100 iterations
So divide the time with 25
So around 1-2 μs latency
But this is unrealistic test, as no work is done and thus threads are only waiting to access the Channel
Dividing by 100 would be more realistic -> ~400 ns latency

Dec 16 '22 21:12 tjjarvinen

thanks for this, I'll think about how a realistic test might look like.

Dec 16 '22 21:12 cortner

coming back to this - there are three scenarios:

expensive, possibly batched MLIPs : that kind of overhead would cost us nothing ...
cheap MLIPs targeting 1-10 us latency, especially for bio : unclear to me?
ultra-cheap potentials in the 30-100ns / atom range. For those this overhead sounds potentially problematic

To test this why not try it out by multi-threading the evaluation of an EAM or SW potential? That's already in JuLIP. This is the worst-case scenario and if that scales well then I think we are fine.

In the meantime I'll finish a prototype of the new ACE evaluator, and you can then try it out with that as well?

Dec 28 '22 23:12 cortner

JuLIP.jl JuLIP.jl copied to clipboard

Redesign Multi-threading

1

2

Channels

JuLIP.jl
JuLIP.jl copied to clipboard