oneTBB icon indicating copy to clipboard operation
oneTBB copied to clipboard

Stall with high number of threads and CPU load on task_arena::enqueue

Open ggris opened this issue 3 years ago • 5 comments

Hi, From time to time, we experience stalls on task_arena::enqueue on a thread intensive application. I spent some time diagnosing the issue.

It seems to come from how task_stream::push and task_stream::pop work. Essentially, they cycle through all lanes trying to find one which is not locked and not empty.

During my stalls, I observe that:

  • The main thread is not running
  • Some of the worker threads are spinning through the lane list trying to pop a task without success

Since the non empty state of the queue is accessible through the atomic population variable before the lock is released, when all queues are empty and a thread is pushing a new item, when the atomic is updated but the lock not yet released, the pop behaves like a spin_lock. My guess is that it results in the typical spin_lock issue where the thread holding the lock is not running because the system scheduler is running computation intensive spinner threads instead.

This scenario sounds unlikely but over time we are enqueuing many times and it's often when the lanes are empty. Also, it matches perfectly my debugging and VTune profiling of the issue.

ggris avatar Aug 24 '21 14:08 ggris

Good catch! Thank you for the investigation.

Can you try the workaround? Try to add backoff strategy to the pop method (task_stream.h:196-199):

-        do {
-            lane = next_lane( /*out_of=*/N );
-            __TBB_ASSERT( lane < N, "Incorrect lane index." );
-        } while( !empty() && !(popped = try_pop( lane )) );
+        for (atomic_backoff b; !empty() && !popped; b.pause()) {
+            lane = next_lane( /*out_of=*/N);
+            __TBB_ASSERT(lane < N, "Incorrect lane index.");
+            popped = try_pop(lane);
+        }

alexey-katranov avatar Aug 24 '21 15:08 alexey-katranov

I'll try that now. I tried to add a simple counter to break out of the loop after a certain number of iteration and it fixes the issue. I'll be back quickly with the backoff change.

ggris avatar Aug 24 '21 15:08 ggris

Yep, adding the backoff fixes the issue too

ggris avatar Aug 24 '21 15:08 ggris

@alexey-katranov, does it make sense to make PR with these changes ?

anton-potapov avatar Mar 04 '22 06:03 anton-potapov

It seems yes.

alexey-katranov avatar Mar 04 '22 06:03 alexey-katranov