Georgii Evtushenko
Georgii Evtushenko
There was reported an issue regarding the internal accumulator type in `cub::DeviceScan::InclusiveSum`. The issue consists in using input data type as accumulator type. Here's the reproducer: ```cuda #include #include int...
1. There is no difference in performance and compilation time for the reduce with simple operators. On complex operators (256 sqrt calls), the compilation time is up to 2.4 times...
I've tried to request proper addresses from MSI tech support, to no avail. ``` EC dump 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D...
This PR adds visualization to the comparison script. Here's an example for: ```cpp // ... .add_int64_power_of_two_axis("elements", nvbench::range(12, 28, 4)) .add_int64_axis("ratio", {0, 10}); ``` When script is executed, one can specify...
Recent results show that noise could be increased up to 50% due to X-Server running on the device. To warn users about the noisy environment, we could check if GPU...
The issue is related to the following [one](https://github.com/NVIDIA/cub/issues/545). It was recently addressed in [CUB](https://github.com/NVIDIA/cub/pull/547). Thrust has to inherit the namespace solution for the CUDA backend.
Before porting to CUB, Thrust implementation of merge sort didn't use to have `*copy` version. When introducing `Copy` overload, I followed the CUB generic scheme of selecting output iterator value...
This PR addresses the following [issue](https://github.com/NVIDIA/cccl/issues/902) by replacing `__launch_bounds__` usages with `CUB_DETAIL_LAUNCH_BOUNDS`. `CUB_DETAIL_LAUNCH_BOUNDS` leads to `__launch_bounds__` usage only when RDC is **not** specified. Builds without RDC are not affected by...