cub `cub::ThreadLoadAsync` and friends, abstractions for asynchronous data movement

This is an exposure for Ampere's asynchronous copy mechanism, based on CUTLASS's implementation.

These primitives are useful for people writing their own kernels, BUT we can also potentially use them transparently in existing CUB block mechanisms, like BlockLoad and BlockStore.

Essentially, any time we have a repeated series of copies, we could use this. For example, this code:

    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
    {
        if ((linear_tid * ITEMS_PER_THREAD) + ITEM < valid_items)
        {
            items[ITEM] = block_itr[(linear_tid * ITEMS_PER_THREAD) + ITEM];
        }
    }

The above code does a series of copies, which are not contiguous in memory. You couldn't replace this whole loop with a memcpy; the destination is contiguous, but the src is not.

We don't have any compute work to overload with the copies here, but it is still beneficial to replace them with asynchronous copies (@ogiroux and @griwes can explain why).

So that code could become:

    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
    {
        if ((linear_tid * ITEMS_PER_THREAD) + ITEM < valid_items)
        {
            cub::ThreadLoadAsync<cub::LOAD_DEFAULT>(items + ITEM, block_itr + (linear_tid * ITEMS_PER_THREAD) + ITEM);
        }
    }
   cub::ThreadLoadWait();

TODO:

[ ] Add tests
[ ] Deploy this in BlockLoad, BlockStore, BlockExchange, and specific algorithms - basically anywhere there's a series of copies.
[ ] Add peeling, widening, and remaindering, possibly unifying this with the vectorization machinery.

Oct 05 '20 19:10 brycelelbach

@brycelelbach If it is possible to pass a pointer within a SMEM array, will cub::ThreadLoadAsync<cub::LOAD_DEFAULT> be smart enough to convert the instructions to LDGSTS?

Mar 24 '21 19:03 mnicely

That's the idea.

On Wed, Mar 24, 2021 at 12:51 PM Matthew Nicely @.***> wrote:

@brycelelbach https://github.com/brycelelbach If it is possible to pass a pointer within a SMEM array, will cub::ThreadLoadAsynccub::LOAD_DEFAULT be smart enough to convert the instructions to LDGSTS?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/cub/pull/209#issuecomment-806132913, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADBG4RT6FYDS7NSYUP4M6LTFI7E5ANCNFSM4SFCJDPA .

-- Bryce Adelstein Lelbach aka wash (he/him/his) US Programming Language Standards (PL22) Chair ISO C++ Library Evolution Chair CppCon and C++Now Program Chair HPC Programming Models Architect @ NVIDIA

Mar 26 '21 01:03 brycelelbach

I'm not sure if we want to expose this in CUB. Closing for now.

Jun 07 '23 14:06 gevtushenko