LoopVectorization.jl `@avxt` harms the performence of `.Threads`

trafficstars

I have tried to replace @avx my framework with @avxt since it should have a better performence. However, the neural network has become unbelievably slow. So I did a small experiment with FCNN which almost only requires matmul and matadd.

The structure of network:

model = Sequential()
model.add_layer(model, Dense; input_size=786, layer_size=512, activation_function=ReLU)
model.add_layer(model, Dense; layer_size=256, activation_function=ReLU)
model.add_layer(model, Dense; layer_size=128, activation_function=ReLU)
model.add_layer(model, Dense; layer_size=64, activation_function=ReLU)
model.add_layer(model, Dense; layer_size=10, activation_function=Softmax_CEL)

SGD.fit(model=model, input_data=flatten(train_x, 3), output_data=One_Hot(train_y, 10, dict),
        loss_function=Categorical_Cross_Entropy_Loss, monitor=Classification, epochs=10, batch=128)

The code using @avx or @avxt:

function activate_Dense(layer::Dense, input::Array{Float32})
    @avxt for x in axes(layer.weights, 1), y in axes(input, 2)
        c = 0.0f0
        for z in axes(layer.weights, 2)
            c += layer.weights[x,z]*input[z,y]
        end
        layer.value[x,y] = c+layer.biases[x]
    end
    # layer.value = layer.weights*input .+ layer.biases
    @time layer.output = layer.activation_function.func(layer.value)
end

function update_Dense(layer::Dense, optimizer::String, Last_Layer_output::Array{Float32}, Next_Layer_propagation_units::Array{Float32}, α::Float64, parameters::Tuple, direction::Int64=1)
    @time layer.activation_function.get_∇biases!(layer.∇biases, layer.value, Next_Layer_propagation_units)
    @avxt for x in axes(layer.weights, 2), y in axes(layer.∇biases, 2)
        c = 0.0f0
        for z in axes(layer.weights, 1)
            c += layer.weights[z,x]*layer.∇biases[z,y]
        end
        layer.propagation_units[x,y] = c
    end
    # layer.propagation_units = transpose(layer.weights)*∇biases

    @time if optimizer=="SGD"
        @avxt for x in axes(layer.∇biases, 1), y in axes(Last_Layer_output, 1)
            layer.weights[x,y] -= α*layer.∇biases[x,1]*Last_Layer_output[y,1]*direction
        end
        # layer.weights -= ∇biases*transpose(Last_Layer_output).*α
        @avxt for i in 1:length(layer.biases)
            layer.biases[i] -= α*layer.∇biases[i,1]*direction
        end
        # layer.biases -= sum(∇biases, dims=2).*α
        println()
    end
end

I measured the time of parts using mutithreading and the result is quite interesting. Time with @avxt:

# time for forward propagation
0.039304 seconds (44 allocations: 8.969 KiB)
0.039106 seconds (44 allocations: 7.984 KiB)
0.000073 seconds (44 allocations: 7.453 KiB)
0.000015 seconds (43 allocations: 7.141 KiB)
0.000021 seconds (88 allocations: 14.203 KiB)

# backpropagation of layer 5
0.000011 seconds (48 allocations: 7.656 KiB)

# backpropagation of layer 4
0.000028 seconds (41 allocations: 6.781 KiB)

# backpropagation of layer 3
0.038863 seconds (42 allocations: 6.812 KiB)

# backpropagation of layer 2
0.038591 seconds (43 allocations: 6.844 KiB)

# backpropagation of layer 1
0.038039 seconds (43 allocations: 6.844 KiB)

Time with @avx:

# time for forward propagation
0.000013 seconds (43 allocations: 8.938 KiB)
0.000012 seconds (43 allocations: 7.953 KiB)
0.000018 seconds (43 allocations: 7.422 KiB)
0.000011 seconds (43 allocations: 7.141 KiB)
0.000020 seconds (88 allocations: 14.203 KiB)

# backpropagation of layer 5
0.000012 seconds (48 allocations: 7.656 KiB)

# backpropagation of layer 4
0.000009 seconds (41 allocations: 6.781 KiB)

# backpropagation of layer 3
0.000009 seconds (41 allocations: 6.781 KiB)

# backpropagation of layer 2
0.000010 seconds (41 allocations: 6.781 KiB)

# backpropagation of layer 1
0.000010 seconds (41 allocations: 6.781 KiB)

It seems that @avxt is competing against multithreading of base even it is not called.

Mar 24 '21 18:03 SkyWorld117

Which libraries is this using? How can I reproduce it?

Searching JuliaHub for Categorical_Cross_Entropy_Loss shows 0 hits. EDIT: https://github.com/SkyWorld117/YisyAIFramework.jl

I was going to say that it's unfortunately expected that using Base.@threads or Base.@spawn at the same time will cause a massive slowdown or even a deadlock. I'm not quite sure what this means:

It seems that @avxt is competing against multithreading of base even it is not called.

but it sounds worrisome. Even without any @threads or @spawn, you get a massive slowdown from @avxt?

That issue aside, CheapThreads.batch should work. There's also a batch function that can reserve some number of threads for each parallel batched operation, but I haven't tested yet. I guess I'll work on that tonight. This will let you, for example, on a 16 core machine run 4 jobs in parallel that each have 4 threads allocated to them.

Mar 24 '21 22:03 chriselrod

But I'll also try something that should at least limit the slowdowns.

Mar 24 '21 23:03 chriselrod

Switching back and forth between using CheapThreads's threads (which LoopVectorization uses) and base threads is also expected to cause performance problems, not just nesting.

Mar 25 '21 00:03 chriselrod

but it sounds worrisome. Even without any @threads or @spawn, you get a massive slowdown from @avxt?

Sorry for misunderstanding, I mean I get a massive slowdown before @avxt is activated.

Switching back and forth between using CheapThreads's threads (which LoopVectorization uses) and base threads is also expected to cause performance problems, not just nesting.

I think you've already interpreted correctly. However, I did not test if it does slow down the program even without @threads or @spawn since my code is mostly using either @avxt or @threads.

Mar 25 '21 09:03 SkyWorld117

There's also a batch function that can reserve some number of threads for each parallel batched operation, but I haven't tested yet. I guess I'll work on that tonight. This will let you, for example, on a 16 core machine run 4 jobs in parallel that each have 4 threads allocated to them.

Sounds great. I am considering to switch from @threads to @avxt since the latest version seems to be supporting ifelse.

Mar 25 '21 09:03 SkyWorld117

I've added batch with the option to reserve threads, but if you're not nesting threads, the ordinary batch method in place of @threads would be fine. If you do try it, I'd be very interested in the resulting performance.

I should write an @batch macro that works like @threads.

Mar 25 '21 14:03 chriselrod

If you do try it, I'd be very interested in the resulting performance.

I've tested @batch and it works well. Since I changed a little bit too much of the framework, I cannot tell how many times faster it has become.

However, I noticed a bug (possibly), either. After about 400 loops, it just stops working and it has a cpu usage about 98 percent. That seems to be a dead lock.

Apr 10 '21 20:04 SkyWorld117

If you Ctrl+c and it doesn't crash Julia, you could

julia> using ThreadingUtilites

julia> ThreadingUtilities.TASKS

if one of them, say the third, says it failed

julia> ThreadingUtilities.TASKS[3]

should show the stack trace.

If code throws an error, it'll deadlock. That's a fairly common cause.

julia> ThreadingUtilites.reinitialize_tasks!()

julia> CheapThreads.reset_workers!()

should then be able to reset things while debugging issues.

If there is a bug in @batch or @avxt causing deadlocks, a reproducer would be helpful.

Apr 10 '21 20:04 chriselrod

If you Ctrl+c and it doesn't crash Julia, you could
julia> using ThreadingUtilites

julia> ThreadingUtilities.TASKS

Unfortunately, it crashes. I'll provide an example later.

Apr 10 '21 20:04 SkyWorld117

using MLDatasets, LoopVectorization, CheapThreads

train_x, train_y = MNIST.traindata()
# Or this one if Lecun's server is down again...
#train_x, train_y = FashionMNIST.traindata()

mutable struct Conv2D
    strides
    output
    filters
    biases
end

mutable struct Dense
    weights
    biases
    output
end

mutable struct MaxPooling2D
    strides
    pool_size
    index
    output
end

mutable struct Flatten
    output_shape
    output
end

function activate_Conv2D(layer::Conv2D, input::Array{Float32})
    x, y = layer.strides
    @avx for i in axes(layer.output, 1), j in axes(layer.output, 2), f in axes(layer.output, 3), b in axes(layer.output, 4)
    s = 0.0f0
        for k₁ in axes(layer.filters, 3), k₂ in axes(layer.filters, 4), c in axes(layer.filters, 2)
            s += input[(i-1)*x+k₁, (j-1)*y+k₂, c, b] * layer.filters[f, c, k₁, k₂]
        end
        layer.output[i, j, f, b] = s + layer.biases[f]
    end
end

function activate_Dense(layer::Dense, input::Array{Float32})
    @avxt for x in axes(layer.weights, 1), y in axes(input, 2)
        c = 0.0f0
        for z in axes(layer.weights, 2)
            c += layer.weights[x,z]*input[z,y]
        end
        layer.output[x,y] = c+layer.biases[x]
    end
end

function activate_MaxPooling2D(layer::MaxPooling2D, input::Array{Float32})
    x, y = layer.strides
    @batch for i in axes(layer.output, 1), j in axes(layer.output, 2), c in axes(layer.output, 3), b in axes(layer.output, 4)
        s = -Inf32
        for p₁ in 1:layer.pool_size[1], p₂ in 1:layer.pool_size[2]
            if input[(i-1)*x+p₁, (j-1)*y+p₂, c, b]>=s
                s = input[(i-1)*x+p₁, (j-1)*y+p₂, c, b]
                layer.index[1, i, j, c, b] = (i-1)*x+p₁
                layer.index[2, i, j, c, b] = (j-1)*y+p₂
            end
        end
        layer.output[i, j, c, b] = s
    end
end

function activate_Flatten(layer::Flatten, input::Array{Float32})
    layer.output = Array{Float32}(reshape(input, (layer.output_shape..., size(input)[end])))
end

layer₁ = Conv2D((1,1), zeros(Float32, 26,26,16,1), rand(Float32, 16,1,3,3), rand(Float32, 16))
layer₂ = Conv2D((1,1), zeros(Float32, 24,24,32,1), rand(Float32, 32,16,3,3), rand(Float32, 32))
layer₃ = MaxPooling2D((2,2), (2,2), zeros(Int64, 2,12,12,32,1), zeros(Float32, 12,12,32,1))
layer₄ = Flatten((12*12*32,), nothing)
layer₅ = Dense(rand(Float32, 128,12*12*32), rand(Float32, 128), zeros(Float32, 128,1))
layer₆ = Dense(rand(Float32, 10,128), rand(Float32, 10), zeros(Float32, 10,1))

for i in 1:1000
    println("Loop ", i)
    current_input_data = Array{Float32}(reshape(selectdim(train_x, 3, rand(1:60000)), 28,28,1,1))
    activate_Conv2D(layer₁, current_input_data)
    activate_Conv2D(layer₂, layer₁.output)
    activate_MaxPooling2D(layer₃, layer₂.output)
    activate_Flatten(layer₄, layer₃.output)
    activate_Dense(layer₅, layer₄.output)
    activate_Dense(layer₆, layer₅.output)
end

Apr 11 '21 13:04 SkyWorld117

Just to be a broken record, you should avoid indexes like this in general whenever you can, especially if x is likely to be 1:

(i-1)*x

SIMD means "Single Instruction Multiple Data". That is, a single CPU instruction operating on multiple data elements. LoopVectorziation tries to use SIMD instructions to speed up loops. @simd/@fastmath will do this too, and often just @inbounds is enough -- LoopVectorization does a few extra optimizations to get better performance than these, my point here is to say that using SIMD instructions are one of the best ways to speed up code, for the obvious reason that if you do 2, 4, 8, or 16x the work per instruction, you'll finish all the work you need to do 2, 4, 8, or 16x faster. However, not all SIMD instructions are equivalent. Lets look at a really simple example and the assembly:

julia> using VectorizationBase

julia> x = rand(Float32,16);

julia> vload(stridedpointer(x), (MM(pick_vector_width(eltype(x)),1),))
Vec{16, Float32}<0.8831433f0, 0.1710602f0, 0.13737059f0, 0.87197435f0, 0.9179288f0, 0.6723312f0, 0.72404635f0, 0.74528587f0, 0.5618619f0, 0.6402745f0, 0.5597935f0, 0.5871539f0, 0.23035526f0, 0.33642197f0, 0.57830274f0, 0.5619949f0>

julia> @code_native debuginfo=:none syntax=:intel vload(stridedpointer(x), (MM(pick_vector_width(eltype(x)),1),))

        .text
        mov     rax, rdi
        mov     rcx, qword ptr [rdx]
        mov     rdx, qword ptr [rsi]
        vmovups zmm0, zmmword ptr [rdx + 4*rcx - 4]
        vmovaps zmmword ptr [rdi], zmm0
        vzeroupper
        ret
        nop

The only line that actually matters here -- every other line will be deleted by the compiler if we actually used this inside a function like activate_Conv2D -- is this:

vmovups zmm0, zmmword ptr [rdx + 4*rcx - 4]

rdx is a pointer to x, and rcx is the index (i). A Float32 is 4 bytes:

julia> sizeof(Float32)
4

which is why we multiply the index by 4 to calculate which address to load from. Finally, the -4 is because we're using 1-based indexing, so it needs to subtract 1*sizeof(Float32) to get the offset. Finally, the vmovups means that it is moving memory from that address/ptr ([rdx + 4*rcx - 4]) to the register zmm0. A zmm register is 512 bits, meaning it can hold 16 x Float32, meaning this one instruction loads 16 numbers from memory that we can then use. (In normal code, zmm0 would then be used to do something, but in this example it then gets stored to some other place in memory so the function can return it.)

Now lets say we multiple our index by 1:

julia> vload(stridedpointer(x), (MM(pick_vector_width(eltype(x)),1)*1,))
Vec{16, Float32}<0.8831433f0, 0.1710602f0, 0.13737059f0, 0.87197435f0, 0.9179288f0, 0.6723312f0, 0.72404635f0, 0.74528587f0, 0.5618619f0, 0.6402745f0, 0.5597935f0, 0.5871539f0, 0.23035526f0, 0.33642197f0, 0.57830274f0, 0.5619949f0>

julia> @code_native debuginfo=:none syntax=:intel vload(stridedpointer(x), (MM(pick_vector_width(eltype(x)),1)*1,))

        .text
        vpternlogd      zmm0, zmm0, zmm0, 255
        vpaddd  zmm0, zmm0, zmmword ptr [rdx]
        mov     rax, qword ptr [rsi]
        kxnorw  k1, k0, k0
        vgatherdps      zmm1 {k1}, zmmword ptr [rax + 4*zmm0]
        mov     rax, rdi
        vmovaps zmmword ptr [rdi], zmm1
        vzeroupper
        ret
        nop     dword ptr [rax + rax]

Note that we loaded the exact same 16 numbers as before, but now the assembly is different. Our index is now a vector, occupying a 512 bit zmm register (zmm0) so that our address is zmmword ptr [rax + 4*zmm0]. As we aren't loading one big chunk of memory, but gathering lots of individual indices, the instruction to load the memory is now vgatherdps instead of vmovups. To give you an idea of the performance impact:

julia> function multiload(x, i)
           p = stridedpointer(x)
           v1 = vload(p, (i,))
           v2 = vload(p, (i+16,))
           v3 = vload(p, (i+32,))
           v4 = vload(p, (i+48,))
           VecUnroll((v1,v2,v3,v4))
       end
multiload (generic function with 1 method)

julia> @btime multiload($x, MM(pick_vector_width(eltype($x)),1))
  1.649 ns (0 allocations: 0 bytes)
4 x Vec{16, Float32}
Vec{16, Float32}<0.4373597f0, 0.67004335f0, 0.07945454f0, 0.66608286f0, 0.8388102f0, 0.7189814f0, 0.53091013f0, 0.18715942f0, 0.9468211f0, 0.65182185f0, 0.14534855f0, 0.28283846f0, 0.92296314f0, 0.4218129f0, 0.873912f0, 0.43823516f0>
Vec{16, Float32}<0.90634537f0, 0.7699486f0, 0.48260927f0, 0.9546652f0, 0.7533071f0, 0.9449183f0, 0.764109f0, 0.6522809f0, 0.52228475f0, 0.8121561f0, 0.076630116f0, 0.6374614f0, 0.35767484f0, 0.015168905f0, 0.81803787f0, 0.07889664f0>
Vec{16, Float32}<0.09138608f0, 0.2104485f0, 0.29355538f0, 0.5822371f0, 0.11811316f0, 0.4566834f0, 0.4941249f0, 0.53982425f0, 0.69500875f0, 0.66140866f0, 0.7891264f0, 0.87181854f0, 0.8146242f0, 0.379063f0, 0.14777255f0, 0.8047527f0>
Vec{16, Float32}<0.60902274f0, 0.3732028f0, 0.058312535f0, 0.22658408f0, 0.70087886f0, 0.4336536f0, 0.1425432f0, 0.78658426f0, 0.8696759f0, 0.044864297f0, 0.0072788f0, 0.52247953f0, 0.47554064f0, 0.6806071f0, 0.83885777f0, 0.60636187f0>

julia> @btime multiload($x, MM(pick_vector_width(eltype($x)),1)*1)
  14.437 ns (0 allocations: 0 bytes)
4 x Vec{16, Float32}
Vec{16, Float32}<0.4373597f0, 0.67004335f0, 0.07945454f0, 0.66608286f0, 0.8388102f0, 0.7189814f0, 0.53091013f0, 0.18715942f0, 0.9468211f0, 0.65182185f0, 0.14534855f0, 0.28283846f0, 0.92296314f0, 0.4218129f0, 0.873912f0, 0.43823516f0>
Vec{16, Float32}<0.90634537f0, 0.7699486f0, 0.48260927f0, 0.9546652f0, 0.7533071f0, 0.9449183f0, 0.764109f0, 0.6522809f0, 0.52228475f0, 0.8121561f0, 0.076630116f0, 0.6374614f0, 0.35767484f0, 0.015168905f0, 0.81803787f0, 0.07889664f0>
Vec{16, Float32}<0.09138608f0, 0.2104485f0, 0.29355538f0, 0.5822371f0, 0.11811316f0, 0.4566834f0, 0.4941249f0, 0.53982425f0, 0.69500875f0, 0.66140866f0, 0.7891264f0, 0.87181854f0, 0.8146242f0, 0.379063f0, 0.14777255f0, 0.8047527f0>
Vec{16, Float32}<0.60902274f0, 0.3732028f0, 0.058312535f0, 0.22658408f0, 0.70087886f0, 0.4336536f0, 0.1425432f0, 0.78658426f0, 0.8696759f0, 0.044864297f0, 0.0072788f0, 0.52247953f0, 0.47554064f0, 0.6806071f0, 0.83885777f0, 0.60636187f0>

The innocuous looking *1 is a performance killer, because gathers are much slower than the vmovup* family. Scatters, the storing equivalent of gather, are even slower. Both gather and scatter are relatively good on this computer I benchmarked on compared to most.

So what I say here about avoiding *1 isn't only about convolutions.

Apr 11 '21 22:04 chriselrod

Anyway, your example compiled and ran for me for all 1000 iterations. I then reran it, and this time it hung on iteration 731.

Loop 731

^C^C^C^C^C^CWARNING: Force throwing a SIGINT
^C^C^CERROR: ^C^CInterruptException:
Stacktrace:
 [1] _atomic_load
   @ ~/.julia/packages/ThreadingUtilities/F2ye8/src/atomics.jl:8 [inlined]
 [2] wait
   @ ~/.julia/packages/ThreadingUtilities/F2ye8/src/threadtasks.jl:62 [inlined]
 [3] wait
   @ ~/.julia/packages/ThreadingUtilities/F2ye8/src/threadtasks.jl:57 [inlined]
 [4] macro expansion
   @ ~/.julia/dev/CheapThreads/src/batch.jl:88 [inlined]
 [5] _batch_no_reserve
   @ ~/.julia/dev/CheapThreads/src/batch.jl:53 [inlined]
 [6] batch(::var"#1#2", ::Tuple{Int64, Int64}, ::Static.StaticInt{1}, ::Static.StaticInt{1}, ::MaxPooling2D, ::Float32, ::Array{Float32, 4}, ::Int64, ::Int64)
   @ CheapThreads ~/.julia/dev/CheapThreads/src/batch.jl:182
 [7] macro expansion
   @ ~/.julia/dev/CheapThreads/src/closure.jl:125 [inlined]
 [8] activate_MaxPooling2D(layer::MaxPooling2D, input::Array{Float32, 4})
   @ Main ./REPL[12]:3
 [9] top-level scope
   @ ./REPL[22]:6

So I can reproduce the hang, thanks. Thankfully no crash. I actually ran all the examples from the above post in the same Julia session after Ctrl+Cing the hang. None of the tasks failed, and all of them are apparently ready, waiting for their next job:

julia> for t ∈ 1:Threads.nthreads()-1
           state = ThreadingUtilities.load(ThreadingUtilities.taskpointer(t), ThreadingUtilities.ThreadState)
           @show state
       end
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT
state = ThreadingUtilities.WAIT

Note that the hang happend in ThreadingUtilities.wait, which is defined as:

@inline function wait(p::Ptr{UInt})
    # note: based on relative values (SPIN = 0, WAIT = 1)
    # thus it should spin for as long as the task is doing anything else
    while _atomic_load(p) > reinterpret(UInt, WAIT)
        pause()
    end
end

So if the state is WAIT, the while loop should stop repeating (becuase 1 > 1 is false) and the function should return. But apparently it got stuck here, even though every thread says WAIT. =/

Apr 11 '21 22:04 chriselrod

So what I say here about avoiding *1 isn't only about convolutions.

Alright, that is really a detailed explaination, thank you very much.

I actually noticed the performence difference during a test, but I still decided to keep them for now, because it is important control the 2-dimensional output size with strides especially in GAN. Before I can figure out a better way to avoid that, I would rather have tunable strideds with a low speed than restricted strides with high speed.

An idea to avoid that is to use functions like view and reshape to create a dynamic connection (pointers?) between input data and reshaped arrays, so I only have to run the inefficient dynamic indexing for once in the initialization.

So if the state is WAIT, the while loop should stop repeating (becuase 1 > 1 is false) and the function should return. But apparently it got stuck here, even though every thread says WAIT. =/

That's weird.

Anyway, your example compiled and ran for me for all 1000 iterations. I then reran it, and this time it hung on iteration 731.

Does it relate to the number of threads? Because I noticed that on my laptop, it always stucks before finishing the loop. But it could also be a random case though. Just for a comparison, my CPU is i7-1165G7 (4c8t).

Apr 12 '21 09:04 SkyWorld117

That was on an 18 core CPU, but I also have a laptop with an i7-1165G7 so I'll try on that, too.

I'll hopefully have time to look at this next week.

Apr 28 '21 03:04 chriselrod

I thought it may relate to GC, and turns out it is not the case.

I ran the program again on a laptop with 8c16t and it did not hang at all. I suppose there is somehow an upper bound of number of iterations for each thread we can activate and the less threads we use, the faster we hit the bound.

May 24 '21 15:05 SkyWorld117

I also have an i7 1165G7:

julia> versioninfo()
Julia Version 1.7.0-DEV.1150
Commit a08a3ff1f9* (2021-05-22 21:10 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.0 (ORCJIT, tigerlake)
Environment:
  JULIA_NUM_THREADS = 8

Anyway, I still don't really know why exactly this was hanging. When it hung, the wait loop would keep checking the status of other threads and find they hadn't finished yet, while those other threads are running at 100% CPU. Once I cancel with ctrl+C, the other threads would immediately stop running, having apparently completed successfully and changed their status to finish.

So, somehow, those other threads' state got set to running, and they indeed began running, but without actually starting yet because they were somehow blocked by the status checks of the first thread? I don't really get it.

By letting the waiting thread eventually start yielding when checks fail, those other threads are for some reason no longer blocked (they shouldn't have been to begin with...), and can start executing.

May 25 '21 03:05 chriselrod

Anyway, try upgrading to ThreadingUtilities version >= v"0.4.3". It works/no longer hangs for me now on that version (v"0.4.3" should be released within a few minutes of me making this comment).

May 25 '21 03:05 chriselrod

Anyways, thanks for the fix, it is indeed working now.

May 25 '21 10:05 SkyWorld117

LoopVectorization.jl LoopVectorization.jl copied to clipboard

`@avxt` harms the performence of `.Threads`

LoopVectorization.jl
LoopVectorization.jl copied to clipboard