Jake Hemstad

Results 209 comments of Jake Hemstad

For reference, in our work in [switching to use Catch2 ](https://github.com/NVIDIA/cub/pull/365) we have found that the [`metal::cartesian`](https://github.com/brunocodutra/metal/blob/e90187ffb37a602f3036750dd00a94042a226000/example/src/list.cpp#L548-L568) implementation in [`metal`](https://github.com/brunocodutra/metal/) is excellent and has much better compile time performance than...

Can we defer any significant refactoring discussion to future work/PR? I don't want to delay @youyu3's PR any more than necessary.

Hey @youyu3 sorry for the last minute change, but based on some internal conversation, I think we can drop the `experimental` namespace for `mdspan`.

@senior-zero could you create a new issue to describe migrating to libcu++ atomics and then close this issue in favor of that one?

Double freeing a pointer is undefined behavior, which includes the behavior seen here.

I like `BatchMemcpy`. > Use different template types for the begin/in/output size iterators Done. > What does output_start_sizes represent? That was a mistake. > ideally the outer dimension could be...

> all of the alignments would need to be specified as template parameters. This would be quite a burden, and would require a unique template instantiation of the entire algorithm...

> Here we have iterators that can be dereferenced on the device only. Actually, when I first envisioned this API, I was thinking the size iterator would be host accessible....

See if https://github.com/NVIDIA/cub/pull/357 resolves your issue

Doesn't `cudaLaunchKernel` have the exact same problem? It can return error codes from previous async calls: > Note that this function may also return error codes from previous, asynchronous launches.