Jake Hemstad

Results 209 comments of Jake Hemstad

> If there's multiple choices for a kernel, the CUDA runtime seems to choose any qualifying kernel candidate "at random". Let me make sure I'm following what's going on here....

This piqued my curiosity and I went far down a rabbit hole. TL;DR: There is something extremely odd going on here that I don't understand and just making the kernel...

Yep, we ran into this in RMM a while back: https://github.com/rapidsai/rmm/issues/410 You might consider using `cudaMallocAsync` instead.

> It is certainly true that adding int64_t instantiations increases compile time, and that they come with a non-trivial performance penalty. In pytorch land we are working around both these...

Couldn't the index type also be inferred from the `std::iterator_traits::difference_type`? That might annoy existing users passing in raw pointers, that's going to default to 64 bit indices (`ptrdiff_t`). CUB could...

> Use unsigned offsets instead of signed -- since we'll be porting these algorithms incrementally, we can afford to spend some time fixing any issues that arise from the change...

> I'm all in for using existing building blocks. The problem is that I didn't assume the pointers to be aligned and so had to devise special treatment to be...

Huge +1 from me. I experimented with using Catch2 in [cuCollections ](https://github.com/NVIDIA/cuCollections/tree/dev/tests) and I have loved it. I know CUB doesn't currently use GTest, but many of my favorite features...

> Have you encountered an issue related to Catch2 usage in `.cu` files? All the test files in cuCollections are `.cu` files: https://github.com/NVIDIA/cuCollections/tree/dev/tests The is one warning that's generated when...

Indeed, I believe the nvcc frontend has special handling for that attribute expansion. clang would need to emulate that "special" handling :slightly_smiling_face: