DeepEP icon indicating copy to clipboard operation
DeepEP copied to clipboard

some questions in ibgda

Open Thunderbrook opened this issue 8 months ago • 2 comments

Hi, I have some questions about the following code in ibgda

void ibgda_submit_requests(nvshmemi_ibgda_device_qp_t *qp, uint64_t base_wqe_idx,
                           uint32_t num_wqes, int message_idx = 0) {
    nvshmemi_ibgda_device_qp_management_t *mvars = &qp->mvars;
    uint64_t new_wqe_idx = base_wqe_idx + num_wqes;

    // WQE writes must be finished first
    __threadfence();    // (1)

    // Wait for prior WQE slots to be filled first
    auto *ready_idx = reinterpret_cast<unsigned long long int*>(&mvars->tx_wq.ready_head);
    while (atomicCAS(ready_idx, base_wqe_idx, new_wqe_idx) != base_wqe_idx);     // (2)

    // Always post, not in batch
    constexpr int kNumRequestInBatch = 4;
    if (kAlwaysDoPostSend or (message_idx + 1) % kNumRequestInBatch == 0)
        ibgda_post_send(qp, new_wqe_idx);
}

(1) I personally understand that the purpose of threadfence here is to ensure that writing to the WQE and writing to the DB do not occur out of order. From the view of the NIC, should threadfence_system be used instead? (2) I personally understand that all threads executing atomicCAS have different compare/swap values, so is it necessary to use "atomic" operations in this case?

Thanks

Thunderbrook avatar Apr 08 '25 05:04 Thunderbrook

The ibgda_submit_requests code here is simplified from NVSHMEM, and I'm also somewhat unclear about the synchronization semantics needed when GPUs and NICs interact with each other. However, NVSHMEM implements it this way, and I believe that as an internal NVIDIA team, they have more insight into these details and can ensure the safety of this approach. The following is just my personal understanding:

(1) If we consider the NIC as another GPU device, the threadfence here is indeed not strong enough to guarantee that the written WQE is visible to other GPUs. However, I think there might be special mechanisms when the NIC reads WQEs, such as always bypassing the cache, in which case threadfence would be sufficient.

(2) I think you are right, atomic op is not necessary, we can have a try.

sphish avatar Apr 08 '25 06:04 sphish

get it, very thanks for reply~

Thunderbrook avatar Apr 08 '25 07:04 Thunderbrook