gonzalobg issues

Results 26 issues of


gonzalobg

Soundness bugfix for barrier<thread_scope_block> on sm_70

For sm_70, barrier arrive has an optimization to "coalesce" all arrives with the same update to the same barrier into a single update performed by a "leader" thread. This optimization...

Remove membar generation from the code generator

The atomic code generation still generates membars. We should stop doing that and generate fence.sc instead.

Support `cuda::std::barrier<thread_scope_system>` on IPC

Supporting `cuda::std::barrier` on IPC would allow applications synchronizing threads across multiple processes / programs to re-use this barrier. Interaction of threads from multiple programs is out-of-scope of C++, so this...

Add CodeGen tests

We should find a way to leverage FileCheck to perform CodeGen tests to verify the PTX that is actually generated by certain operations from `atomic`, `barrier`, etc.

Missing atomic_signal_fence

https://en.cppreference.com/w/cpp/atomic/atomic_signal_fence An API like this that lowers to `fence syncscope("singlethread") seq_cst` in NVVM-IR would be useful to implement RMA (e.g. in the context of NVSHMEM).

Selectors are not used in barrier_init.pass test

See https://github.com/NVIDIA/libcudacxx/blob/main/.upstream-tests/test/cuda/barrier_init.pass.cpp#L35-L38 The barrier is always stored on the stack independently of the selector used.

Document interaction of asynchronous operations with thread scopes

Closes #137.

P2: nice to have

only: docs

Document why combining pipeline<thread_scope_thread>/barrier<thread_scope_thread> with memcpy_async is legal

According to the thread scope documentation: > A thread scope specifies the kind of threads that can synchronize with each other using a primitive such as an atomic or a...

cuda::barrier<thread_scope_thread> uses exponential backoff

Example (https://cuda.godbolt.org/z/W541eW): ```c++ #include #include #include #include #include using T = float4; __global__ void test_pipe_intr(T* input, T volatile* out) { __shared__ T smem[32]; __pipeline_memcpy_async(smem + threadIdx.x, input + threadIdx.x, sizeof(T));...

Add constructor from mdspan to mdarray

It would be really nice to have a conversion from mdspan to an mdarray with suitable extents and layouts. For one dimensional arrays and spans, layout_left vs layout_right should not...