cccl [FEA]: CUB large input support

[FEA]: CUB large input support

Open jrhemstad opened this issue 1 year ago • 1 comments

As a lower-level interface, CUB should optimize for flexibility and performance. As a result, CUB will not guarantee a large input will work by default. However, it should enable users to specify their desired offset type.

This means CUB should not perform any dynamic dispatch based on the input size. Instead, users should have a way to statically specify the offset type. In previous discussion we favored making the type of num_items a template and infer the offset type from the type of num_items.

### Design-related research
- [ ] https://github.com/NVIDIA/cccl/issues/1408
- [ ] https://github.com/NVIDIA/cccl/issues/1454

### Testing large number of items
- [ ] https://github.com/NVIDIA/cccl/issues/2139
- [ ] https://github.com/NVIDIA/cccl/issues/2140
- [x] Add tests for large number of items in `DeviceScan::*ByKey`

### Enable large `num_items` in CUB algorithms that are not sensitive to the choice of `offset_t`
- [ ] https://github.com/NVIDIA/cccl/issues/2062

### Enable large `num_items` in CUB algorithms that are sensitive to the choice of `offset_t`
- [ ] https://github.com/NVIDIA/cccl/issues/1422
- [ ] https://github.com/NVIDIA/cccl/issues/1437
- [ ] https://github.com/NVIDIA/cccl/issues/2458
- [ ] https://github.com/NVIDIA/cccl/issues/2442
- [ ] Add support for large `num_items` to `device_run_length_encode.cuh`
- [ ] Add support for large `num_items` to `device_merge.cuh`
- [ ] Add support for large `num_items` to `device_transform.cuh`
- [ ] Add support for large `num_items` to `device_segmented_radix_sort.cuh`
- [ ] Add support for large `num_items` to `device_segmented_sort.cuh`
- [ ] Add support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}`
- [ ] Add support for large `num_items` to `DeviceSegmentedReduce::{ArgMin,ArgMax}`
- [ ] ~~Add support for large `num_items` to `device_spmv.cuh`~~

### Clean up interim testing infrastructure
- [x] Switch tests for large number of items to use `Device*` interface in `DeviceSelect`
- [x] Switch tests for large number of items to use `Device*` interface in `DeviceScan`

### Documentation
- [ ] We want to be explicit about supported and tested offset types in the `Dispatch` interface (see https://github.com/NVIDIA/cccl/pull/1817#discussion_r1670650317)

Apr 21 '23 17:04 jrhemstad

legend for offset type: ✅ considered done | 🟡 considered lower priority | 🟠 considered higher priority, as it prevents usage for larger-than-INT_MAX number of items | ⏳ in progress

legend for testing columns: ✅ considered done | 🟡 to be done | 🟠 needs to support wider offset types first

algorithm	offset type	tests larger-than`INT_MAX`	tests close to `[U]INT_MAX`
`device_adjacent_difference.cuh`	✅ `choose_offset_t`	✅ 2^33, sanity check, iterators	🟡
`device_radix_sort.cuh`	✅ `choose_offset_t`	✅ extensive check	✅ extensive check
`device_reduce.cuh`	✅ `choose_offset_t`: `Reduce`, `Sum`, `Min`, `Max`, `ReduceByKey`, `TransformReduce` ⚠️ (note) `int`: `ArgMin`, `ArgMax`	✅ sanity, 2^{30,31,33)	✅ sanity, 2^32-1
`device_scan.cuh`	✅ choose_offset_t: DeviceScan ✅ choose_offset_t: DeviceScanByKey	✅	✅
`device_select.cuh`	✅ `choose_offset_t`: `UniqueByKey` ✅ `streaming` (for any user-provided offset type): `Flagged`, `If`,	✅	✅
`device_partition.cuh`	✅ `streaming` (for larger-than-int user-provided offset types): `Flagged`, `If` ⏳ `int`: `ThreeWayPartition`	✅	✅
`device_copy.cuh`	🟡`num_ranges`: `uint32_t` 🟡`buffer sizes`: `iterator_traits<SizeIteratorT>::value_type`	🟡	🟡
`device_memcpy.cuh`	🟡 `num_ranges`: `uint32_t` 🟡 `buffer sizes`: `iterator_traits<SizeIteratorT>::value_type`	🟡	🟡
`device_for.cuh`	🟡`NumItemsT`: ForEachN, ForEachCopyN, Bulk `difference_type`: ForEach, ForEachCopy	🟡	🟡
`device_histogram.cuh`	🟡 dynamic dispatch: int for (num_rows * row_stride_bytes)<INT_MAX; OffsetT otherwise	🟡	🟡
`device_merge_sort.cuh`	🟡 `NumItemsT`	✅ extensive check	✅ extensive check
`device_segmented_reduce.cuh`	🟡 `common_iterator_value_t({Begin,End}OffsetIteratorT)`: `Reduce`, `Sum`, `Min`, `Max` ⚠️ (note) `int`: `ArgMin`, `ArgMax` `num_segments`: `int`	✅ sanity, rnd [2^31; 2^33]	🟡
`device_run_length_encode.cuh`	🟠 `int`	🟠	🟠
`device_segmented_sort.cuh`	🟠 `num_items` & `num_segments`:`int`	🟠	🟠
`device_segmented_radix_sort.cuh`	🟠 `num_items` & `num_segments`:`int`	🟠	🟠
`device_merge.cuh`	🟠 `int`	🟡	🟡
`device_transform.cuh`	🟠 `int`	🟡	🟡
~~`device_spmv.cuh`~~	🟠 ~~`int`~~ (not planned)	🟠	🟠

Feb 21 '24 10:02 elstehle

cccl cccl copied to clipboard

[FEA]: CUB large input support

cccl
cccl copied to clipboard