Jake Hemstad comments

Results 209 comments of


Jake Hemstad

Support for more generic Cartesian product of type lists in type-parameterized tests

For reference, in our work in [switching to use Catch2 ](https://github.com/NVIDIA/cub/pull/365) we have found that the [`metal::cartesian`](https://github.com/brunocodutra/metal/blob/e90187ffb37a602f3036750dd00a94042a226000/example/src/list.cpp#L548-L568) implementation in [`metal`](https://github.com/brunocodutra/metal/) is excellent and has much better compile time performance than...

Adding `mdspan` reference implementation

Can we defer any significant refactoring discussion to future work/PR? I don't want to delay @youyu3's PR any more than necessary.

Adding `mdspan` reference implementation

Hey @youyu3 sorry for the last minute change, but based on some internal conversation, I think we can drop the `experimental` namespace for `mdspan`.

Incorrect semantics in grid_barrier.cuh

@senior-zero could you create a new issue to describe migrating to libcu++ atomics and then close this issue in favor of that one?

CachingDeviceAllocator::DeviceFree problematic handling of non-live block pointer

Double freeing a pointer is undefined behavior, which includes the behavior seen here.

[FEA] Multi-buffer copy algorithm

I like `BatchMemcpy`. > Use different template types for the begin/in/output size iterators Done. > What does output_start_sizes represent? That was a mistake. > ideally the outer dimension could be...

[FEA] Multi-buffer copy algorithm

> all of the alignments would need to be specified as template parameters. This would be quite a burden, and would require a unique template instantiation of the entire algorithm...

[FEA] Multi-buffer copy algorithm

> Here we have iterators that can be dereferenced on the device only. Actually, when I first envisioned this API, I was thinking the size iterator would be host accessible....

Optimize cub::DeviceSegmented[Radix]Sort for small number of segments

See if https://github.com/NVIDIA/cub/pull/357 resolves your issue

[RFE] Use cudaLaunchKernel instead of <<<>>>

Doesn't `cudaLaunchKernel` have the exact same problem? It can return error codes from previous async calls: > Note that this function may also return error codes from previous, asynchronous launches.