synchronization-benchmarks icon indicating copy to clipboard operation
synchronization-benchmarks copied to clipboard

Deadlock if fewer threads (<args.nthrds) started

Open mjaggi-cavium opened this issue 7 years ago • 12 comments

This is similar to earlier issue I posted sometime back.

After a run of about 20 minutes a deadlock is observed when not all of the 'n' threads (args.nthrds) could be started by main(). All cores on which threads are started are at 100%. The first child thread is waiting for ready_lock, while others are waiting for sync_lock.

This behaviour is observed when number of cores (threaded per core threads 4) is 200+.

Not sure why all nthrds not starting, could be RT throttling issue. Comments suggestions...?

mjaggi-cavium avatar Jun 07 '18 07:06 mjaggi-cavium

Hi, does the issue still happen with the "-s" flag which disables the SCHED_FIFO setting?

geoffreyblake avatar Jun 07 '18 20:06 geoffreyblake

Hi Manish, which workload did you run for this deadlock case? Thanks

zoybai avatar Jun 08 '18 16:06 zoybai

I am running ./runall.sh. The issue is not seen when run a single instance of workload.

mjaggi-cavium avatar Jun 08 '18 17:06 mjaggi-cavium

Hi, does the issue still happen with the "-s" flag which disables the SCHED_FIFO setting? No. Test completes without hanging

mjaggi-cavium avatar Jun 11 '18 17:06 mjaggi-cavium

When running with -s what sort of effective parallelism do you see? It should be close to the number of requested cores. If it's significantly lower then the improvement may be due to -s mode not being able to recreate the high contention case and not directly a problem with FIFO mode itself.

lucasclucasdo avatar Jun 11 '18 17:06 lucasclucasdo

It should be close to the number of requested cores Yes.

I havent seen with -s, anytime number of thread created < nthrds. So mainthread is not starved.

mjaggi-cavium avatar Jun 11 '18 18:06 mjaggi-cavium

The number of threads created will be the same but "effective parallelism" (output by the tool) tells you how many of the actual threads are running at the same time. So you could have 200 cores and 200 threads but if each one runs to completion on one core before the next core starts you can theoretically have effective parallelism of only 1 thread even though thread creation equals requested threads.

lucasclucasdo avatar Jun 11 '18 18:06 lucasclucasdo

AFAIK,

  • main-thread and child thread 0 always runs on hw thread 0.
  • all child threads run on hw-thread0 and are then sets appropriate affinity and later get scheduled on s specific affined cores
  • if all child threads run on hw-thread0 first and with SCHED_FIFO, and child thread 0 always runs on thread 0, would there not be any point that mainthread is starved?

mjaggi-cavium avatar Jun 11 '18 18:06 mjaggi-cavium

It's more likely that the child threads get starved but the scheduler should be waking up cores to steal and run the child threads since there will be balance problems otherwise (one core with two runnable FIFO processes and one core with nothing). One thing I've been thinking about trying is spawning a bunch of threads to make the balance issue look worse and cause the scheduler to step in sooner and then affine threads to whichever unloaded core they end up on first (or exit if the core they end up already has a waiting lockhammer process).

Anyway, that's not relevant to the question I'm asking which is "does safemode successfully achieve the requested contention level." I'm guessing not since FIFO mode was added in to avoid this exact problem in the first place which is why I'm asking. In other words safemode might "solve" the issue you're seeing but it probably does it by making the test a useless measure of performance in the high core count contention case (because it likely fails to achieve it). How does the "effective parallelism" metric compare to requested thread counts for high thread counts where you were previously seeing the scheduling issue?

Edit: slight change, main thread should be free to run anywhere, not just hw thread 0 (if that's not case it's a bug).

lucasclucasdo avatar Jun 11 '18 18:06 lucasclucasdo

I created a test branch which sched_yields the thread on core 0 if all child threads are not ready yet. Unfortunately I cannot replicate this issue on systems to which I have access so please try this branch and see if it helps:

https://github.com/codeauroraforum/synchronization-benchmarks/tree/lh-yieldwait

lucasclucasdo avatar Jun 11 '18 19:06 lucasclucasdo

Tried this, and replaced below as well /* Spin until the "marshal" sets the appropriate bit */ wait64_yield(&sync_lock, (nthrds * 2) | 1);

I think i missed one point, affinity of main thread is all cores, so wherever it is rescheduled and there is a contention not all threads will start. So I believe we need to put sched_yield in all atomic functions.

mjaggi-cavium avatar Jun 15 '18 07:06 mjaggi-cavium

If we yield the other threads then we need to add in another sync step without a yield to make sure everyone is actually both started and running. Eg, current scheme is:

  1. Startup threads
  2. Wait for all threads to startup
  3. Threads are FIFO and unyielding so if they've reported started then they must be running still
  4. Send a start signal since we know threads are all started up (because they told us) and currently running (because they must be by definition)

If we yield the startup threads it should be:

  1. Startup threads
  2. Wait with yielding for all threads to startup
  3. Thread have all started up but may be currently not running due to yielding while startup was ongoing
  4. Wait without yielding for all started up threads to get rescheduled and report back in
  5. Send a start signal since we've confirmed all threads are started up (because they told us) and currently running (because they also told us)

That said I still think this is more of a scheduler balance problem where at high core counts a single core with an extra runnable but not running process (ie, the main thread) doesn't look like too bad of a balance problem so sleeping hardware threads are not woken up to execute the main software thread for a long time in the hopes that one of the many low utilization hardware threads already running can take care of it in a short amount of time (but of course they can't because they're all running FIFO threads that are busy spinning).

lucasclucasdo avatar Jun 15 '18 15:06 lucasclucasdo