cccl icon indicating copy to clipboard operation
cccl copied to clipboard

[FEA]: CUB large input support

Open jrhemstad opened this issue 1 year ago • 1 comments

As a lower-level interface, CUB should optimize for flexibility and performance. As a result, CUB will not guarantee a large input will work by default. However, it should enable users to specify their desired offset type.

This means CUB should not perform any dynamic dispatch based on the input size. Instead, users should have a way to statically specify the offset type. In previous discussion we favored making the type of num_items a template and infer the offset type from the type of num_items.

### Design-related research
- [ ] https://github.com/NVIDIA/cccl/issues/1408
- [ ] https://github.com/NVIDIA/cccl/issues/1454
### Testing large number of items
- [ ] https://github.com/NVIDIA/cccl/issues/2139
- [ ] https://github.com/NVIDIA/cccl/issues/2140
- [x] Add tests for large number of items in `DeviceScan::*ByKey`
### Enable large `num_items` in CUB algorithms that are not sensitive to the choice of `offset_t`
- [ ] https://github.com/NVIDIA/cccl/issues/2062
### Enable large `num_items` in CUB algorithms that are sensitive to the choice of `offset_t`
- [ ] https://github.com/NVIDIA/cccl/issues/1422
- [ ] https://github.com/NVIDIA/cccl/issues/1437
- [ ] https://github.com/NVIDIA/cccl/issues/2458
- [ ] https://github.com/NVIDIA/cccl/issues/2442
- [ ] Add support for large `num_items` to `device_run_length_encode.cuh`
- [ ] Add support for large `num_items` to `device_merge.cuh`
- [ ] Add support for large `num_items` to `device_transform.cuh`
- [ ] Add support for large `num_items` to `device_segmented_radix_sort.cuh`
- [ ] Add support for large `num_items` to `device_segmented_sort.cuh`
- [ ] Add support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}`
- [ ] Add support for large `num_items` to `DeviceSegmentedReduce::{ArgMin,ArgMax}`
- [ ] ~~Add support for large `num_items` to `device_spmv.cuh`~~
### Clean up interim testing infrastructure
- [x] Switch tests for large number of items to use `Device*` interface in `DeviceSelect`
- [x] Switch tests for large number of items to use `Device*` interface in `DeviceScan`
### Documentation
- [ ] We want to be explicit about supported and tested offset types in the `Dispatch` interface (see https://github.com/NVIDIA/cccl/pull/1817#discussion_r1670650317)

jrhemstad avatar Apr 21 '23 17:04 jrhemstad

legend for offset type: ✅ considered done | 🟡 considered lower priority | 🟠 considered higher priority, as it prevents usage for larger-than-INT_MAX number of items | ⏳ in progress

legend for testing columns: ✅ considered done | 🟡 to be done | 🟠 needs to support wider offset types first

algorithm offset type tests larger-thanINT_MAX tests close to [U]INT_MAX
device_adjacent_difference.cuh choose_offset_t ✅ 2^33, sanity check, iterators 🟡
device_radix_sort.cuh choose_offset_t ✅ extensive check ✅ extensive check
device_reduce.cuh choose_offset_t: Reduce, Sum, Min, Max, ReduceByKey, TransformReduce
⚠️ (note) int: ArgMin, ArgMax
✅ sanity, 2^{30,31,33) ✅ sanity, 2^32-1
device_scan.cuh ✅ choose_offset_t: DeviceScan
✅ choose_offset_t: DeviceScanByKey
device_select.cuh choose_offset_t: UniqueByKey
streaming (for any user-provided offset type): Flagged, If,
device_partition.cuh streaming (for larger-than-int user-provided offset types): Flagged, If
int: ThreeWayPartition
device_copy.cuh 🟡num_ranges: uint32_t
🟡buffer sizes: iterator_traits<SizeIteratorT>::value_type
🟡 🟡
device_memcpy.cuh 🟡 num_ranges: uint32_t
🟡 buffer sizes: iterator_traits<SizeIteratorT>::value_type
🟡 🟡
device_for.cuh 🟡NumItemsT: ForEachN, ForEachCopyN, Bulk
difference_type: ForEach, ForEachCopy
🟡 🟡
device_histogram.cuh 🟡 dynamic dispatch: int for (num_rows * row_stride_bytes)<INT_MAX;
OffsetT otherwise
🟡 🟡
device_merge_sort.cuh 🟡 NumItemsT ✅ extensive check ✅ extensive check
device_segmented_reduce.cuh 🟡 common_iterator_value_t({Begin,End}OffsetIteratorT): Reduce, Sum, Min, Max
⚠️ (note) int: ArgMin, ArgMax
num_segments: int
✅ sanity, rnd [2^31; 2^33] 🟡
device_run_length_encode.cuh 🟠 int 🟠 🟠
device_segmented_sort.cuh 🟠 num_items & num_segments:int 🟠 🟠
device_segmented_radix_sort.cuh 🟠 num_items & num_segments:int 🟠 🟠
device_merge.cuh 🟠 int 🟡 🟡
device_transform.cuh 🟠 int 🟡 🟡
~~device_spmv.cuh~~ 🟠 ~~int~~ (not planned) 🟠 🟠

elstehle avatar Feb 21 '24 10:02 elstehle