cccl
cccl copied to clipboard
[FEA]: CUB large input support
As a lower-level interface, CUB should optimize for flexibility and performance. As a result, CUB will not guarantee a large input will work by default. However, it should enable users to specify their desired offset type.
This means CUB should not perform any dynamic dispatch based on the input size. Instead, users should have a way to statically specify the offset type. In previous discussion we favored making the type of num_items
a template and infer the offset type from the type of num_items
.
### Design-related research
- [ ] https://github.com/NVIDIA/cccl/issues/1408
- [ ] https://github.com/NVIDIA/cccl/issues/1454
### Testing large number of items
- [ ] https://github.com/NVIDIA/cccl/issues/2139
- [ ] https://github.com/NVIDIA/cccl/issues/2140
- [x] Add tests for large number of items in `DeviceScan::*ByKey`
### Enable large `num_items` in CUB algorithms that are not sensitive to the choice of `offset_t`
- [ ] https://github.com/NVIDIA/cccl/issues/2062
### Enable large `num_items` in CUB algorithms that are sensitive to the choice of `offset_t`
- [ ] https://github.com/NVIDIA/cccl/issues/1422
- [ ] https://github.com/NVIDIA/cccl/issues/1437
- [ ] https://github.com/NVIDIA/cccl/issues/2458
- [ ] https://github.com/NVIDIA/cccl/issues/2442
- [ ] Add support for large `num_items` to `device_run_length_encode.cuh`
- [ ] Add support for large `num_items` to `device_merge.cuh`
- [ ] Add support for large `num_items` to `device_transform.cuh`
- [ ] Add support for large `num_items` to `device_segmented_radix_sort.cuh`
- [ ] Add support for large `num_items` to `device_segmented_sort.cuh`
- [ ] Add support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}`
- [ ] Add support for large `num_items` to `DeviceSegmentedReduce::{ArgMin,ArgMax}`
- [ ] ~~Add support for large `num_items` to `device_spmv.cuh`~~
### Clean up interim testing infrastructure
- [x] Switch tests for large number of items to use `Device*` interface in `DeviceSelect`
- [x] Switch tests for large number of items to use `Device*` interface in `DeviceScan`
### Documentation
- [ ] We want to be explicit about supported and tested offset types in the `Dispatch` interface (see https://github.com/NVIDIA/cccl/pull/1817#discussion_r1670650317)
legend for offset type: ✅ considered done | 🟡 considered lower priority | 🟠 considered higher priority, as it prevents usage for larger-than-INT_MAX
number of items | ⏳ in progress
legend for testing columns: ✅ considered done | 🟡 to be done | 🟠 needs to support wider offset types first
algorithm | offset type | tests larger-thanINT_MAX |
tests close to [U]INT_MAX |
---|---|---|---|
device_adjacent_difference.cuh |
✅ choose_offset_t |
✅ 2^33, sanity check, iterators | 🟡 |
device_radix_sort.cuh |
✅ choose_offset_t |
✅ extensive check | ✅ extensive check |
device_reduce.cuh |
✅ choose_offset_t : Reduce , Sum , Min , Max , ReduceByKey , TransformReduce ⚠️ (note) int : ArgMin , ArgMax |
✅ sanity, 2^{30,31,33) | ✅ sanity, 2^32-1 |
device_scan.cuh |
✅ choose_offset_t: DeviceScan ✅ choose_offset_t: DeviceScanByKey |
✅ | ✅ |
device_select.cuh |
✅ choose_offset_t : UniqueByKey ✅ streaming (for any user-provided offset type): Flagged , If , |
✅ | ✅ |
device_partition.cuh |
✅ streaming (for larger-than-int user-provided offset types): Flagged , If ⏳ int : ThreeWayPartition |
✅ | ✅ |
device_copy.cuh |
🟡num_ranges : uint32_t 🟡 buffer sizes : iterator_traits<SizeIteratorT>::value_type |
🟡 | 🟡 |
device_memcpy.cuh |
🟡 num_ranges : uint32_t 🟡 buffer sizes : iterator_traits<SizeIteratorT>::value_type |
🟡 | 🟡 |
device_for.cuh |
🟡NumItemsT : ForEachN, ForEachCopyN, Bulk difference_type : ForEach, ForEachCopy |
🟡 | 🟡 |
device_histogram.cuh |
🟡 dynamic dispatch: int for (num_rows * row_stride_bytes)<INT_MAX; OffsetT otherwise |
🟡 | 🟡 |
device_merge_sort.cuh |
🟡 NumItemsT |
✅ extensive check | ✅ extensive check |
device_segmented_reduce.cuh |
🟡 common_iterator_value_t({Begin,End}OffsetIteratorT) : Reduce , Sum , Min , Max ⚠️ (note) int : ArgMin , ArgMax num_segments : int |
✅ sanity, rnd [2^31; 2^33] | 🟡 |
device_run_length_encode.cuh |
🟠 int |
🟠 | 🟠 |
device_segmented_sort.cuh |
🟠 num_items & num_segments :int |
🟠 | 🟠 |
device_segmented_radix_sort.cuh |
🟠 num_items & num_segments :int |
🟠 | 🟠 |
device_merge.cuh |
🟠 int |
🟡 | 🟡 |
device_transform.cuh |
🟠 int |
🟡 | 🟡 |
~~device_spmv.cuh ~~ |
🟠 ~~int ~~ (not planned) |
🟠 | 🟠 |