gonzalobg

Results 26 issues of gonzalobg

For sm_70, barrier arrive has an optimization to "coalesce" all arrives with the same update to the same barrier into a single update performed by a "leader" thread. This optimization...

The atomic code generation still generates membars. We should stop doing that and generate fence.sc instead.

Supporting `cuda::std::barrier` on IPC would allow applications synchronizing threads across multiple processes / programs to re-use this barrier. Interaction of threads from multiple programs is out-of-scope of C++, so this...

We should find a way to leverage FileCheck to perform CodeGen tests to verify the PTX that is actually generated by certain operations from `atomic`, `barrier`, etc.

https://en.cppreference.com/w/cpp/atomic/atomic_signal_fence An API like this that lowers to `fence syncscope("singlethread") seq_cst` in NVVM-IR would be useful to implement RMA (e.g. in the context of NVSHMEM).

See https://github.com/NVIDIA/libcudacxx/blob/main/.upstream-tests/test/cuda/barrier_init.pass.cpp#L35-L38 The barrier is always stored on the stack independently of the selector used.

According to the thread scope documentation: > A thread scope specifies the kind of threads that can synchronize with each other using a primitive such as an atomic or a...

Example (https://cuda.godbolt.org/z/W541eW): ```c++ #include #include #include #include #include using T = float4; __global__ void test_pipe_intr(T* input, T volatile* out) { __shared__ T smem[32]; __pipeline_memcpy_async(smem + threadIdx.x, input + threadIdx.x, sizeof(T));...

It would be really nice to have a conversion from mdspan to an mdarray with suitable extents and layouts. For one dimensional arrays and spans, layout_left vs layout_right should not...