cccl [BUG]: semaphores are not fair

Is this a duplicate?

[X] I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Something else

Component

libcu++

Describe the bug

libcu++ semaphores are not fair, causing the following reproducer to hang (https://cuda.godbolt.org/z/7vGxhaocj):

#include <cuda.h>
#include <cstdio>
#include <cuda/semaphore>

struct semaphore_lock {
    cuda::binary_semaphore<cuda::thread_scope_block> s{1};
    __device__ void lock() {
        s.acquire();
    }
    __device__ void unlock() {
       s.release();
    }
};

__device__ semaphore_lock l{};
__device__ int mask = 0;

__global__ void reproducer() {
  l.lock();
  bool cont = false;
  do {
    l.unlock();
    cont = atomicAdd_block(&mask, threadIdx.x) == 0;
    l.lock();
  } while (cont);
  l.unlock();
}

The standard does not require them to be fair, but that's pretty low QoI, particularly in our platform in which high latency differences between threads to different memories can penalize some threads CAS forever.

How to Reproduce

.

Expected behavior

The expected behavior is for the above program to never hang.

We could achieve this by making the semaphores fair, such that the threads acquire the semaphore in FIFO order. This can be implemented by adding a ticket, and having the threads take a ticket (atomic add) and only acquire the semaphore if there are resources available and it is their turn.

This would turn the lock implemented on top of the semaphore into a ticket lock, which will never hang.

Reproduction link

No response

Operating System

No response

nvidia-smi output

No response

NVCC version

No response

Sep 15 '23 13:09 gonzalobg

I believe that might also be the reason, why the semaphore based mutex implementation is potentially hanging in the tests.

However, I a bit terrified that we would have to essentially double the size of a semaphore from 64 bits to 128 bits because we will need another int / size_t for the ticket.

What are the ABI guarantees again?

Sep 15 '23 17:09 miscco

Making the semaphores fair will be an ABI breaking change even if we keep the size and alignment the same, because we would at least be changing the semantics of the bits within the layout, and the value returned by the ::max() API.

@griwes knows the ABI guarantees we need to uphold, we could plan the change for whenever the ABI can be broken.

However, I a bit terrified that we would have to essentially double the size of a semaphore from 64 bits to 128 bits because we will need another int / size_t for the ticket.

Huge disclaimer, I have never looked at libcu++'s semaphore implementation, and I see there are the following memory layouts:

__atomic_semaphore_base<Sco, LMV> has a single atomic<ptrdiff_t> count (here)
__atomic_semaphore_base<Sco, 1> also (here)
__sem_semaphore_base has a couple of counts. I haven't attempted to understand it yet, but maybe it's already close to what we need. It's only picked if _LIBCUDACXX_USE_NATIVE_SEMAPHORES is defined, which is never? (here)

I'm not familiar with any of these yet.

If the storage for the count needs to be approximately at least large enough to hold LeastMaxValue (don't know if this is the case, just an hypothesis), we could maybe even shrink the current implementation from 64-bit down to 32, 16, or 8 bits depending on LeastMaxValue. Making it fair may require doubling that, but in practice we could maybe fit all practical use cases within a single 64-bit or 96-bit wide type (a 32-bit ticket supports 4 billion threads blocked on the acquire, which may be enough for the next few years).

Sep 15 '23 18:09 gonzalobg

Sorry for barging into this thread, I've just been looking for ways of doing a mutex when I stumbled upon it. Could you explain why that piece of code would be expected to not hang if the semaphore was fair? I can't reason out why that would be the case.

Also, why make a __device__ variable and then do atomicAdd_block on it? That would certainly result in data races.

Mar 01 '24 12:03 mrakgr

Sorry for barging into this thread, I've just been looking for ways of doing a mutex when I stumbled upon it. Could you explain why that piece of code would be expected to not hang if the semaphore was fair? I can't reason out why that would be the case.

Also, why make a __device__ variable and then do atomicAdd_block on it? That would certainly result in data races.

Why? If you're referring to different blocks accessing the mask, then you can just pretend that it says atomicAdd. I wouldn't study the example too deeply though.

Mar 10 '24 20:03 ZelboK

Sorry for barging into this thread, I've just been looking for ways of doing a mutex when I stumbled upon it. Could you explain why that piece of code would be expected to not hang if the semaphore was fair? I can't reason out why that would be the case.

I believe that it comes from the first thread always being the "faster" one when it comes to acquiring the lock. The ticket lock approach suggested here ensures that the second thread has a chance at continuing despite it having slept for some duration

Mar 11 '24 07:03 miscco

Ah, I think I understand where I went wrong in my reasoning. I thought the exit condition was for mask to be 0, but that is actually the continuation condition. So, unless the semaphore was fair, the first thread would keep adding 0 to the mask, never exiting.

Mar 11 '24 12:03 mrakgr

Right. This program has a live-lock.

The GPU thread scheduler is starvation-free, that is, it guarantees that if a thread enters a critical section, it will eventually get to run again, and exit it. But it is possible to build synchronization primitives that are not starvation-free (fair) on top of an starvation-free thread scheduler.

With an unfair semaphore, the following execution is valid, and introduces a live-lock:

Thread 0 runs acquires semaphore.
Thread 1 runs, tries to acquire semaphore, and fails.
Thread 0 completes its critical section (guaranteed by the thread scheduler), releases the semaphore, and acquires it again.
Go back to 2 (live-lock).

Notice that the thread-scheduler is starvation-free, because it always eventually schedules Thread 0, to complete its critical section. However, this program only terminates if the semaphore is also starvation free.

An starvation-free semaphore prevents the live-lock by not allowing Thread 0 to immediately re-acquire the semaphore, if another thread was already waiting on it. That is, with a fair semaphore, Thread 0 cannot acquire the semaphore in Step 3. Thread 1 eventually runs again, and its attempt to acquire the semaphore succeeds. The starvation-free scheduler then guarantees that Thread 1 will eventually complete its critical section, and since Thread 0 is waiting on the semaphore, Thread 1 cannot re-acquire it immediately, Eventually Thread 0 re-acquires the semaphore, and the program completes.

Hope this helps.

Mar 11 '24 13:03 gonzalobg

@miscco @gonzalobg

Question: lets say we have 5 threads. The semaphore is currently being used, so all calls to acquire will block.

Do we reward the first thread to get try_acquire to return true to be the first in line, or should it be the first one who called acquire? The latter is easier to implement but if semantically the former is correct it needs to be done differently.

So if acquire gets called in this order 2,1,3,4, will that be our queue? Because it is possible for try_acquire to return true to thread 4 for example before the other threads. In that sense thread 4 should be first in line.

Aug 06 '24 14:08 ZelboK

cccl cccl copied to clipboard

[BUG]: semaphores are not fair

Is this a duplicate?

Type of Bug

Component

Describe the bug

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

cccl
cccl copied to clipboard