[FEA] Add 64-bit size type option at build-time for libcudf
Many libcudf users have expressed interest in using a 64-bit size type (see #3958 for reference). The cudf::size_type uses a int32_t data type that limits the number of elements in libcudf columns to INT_MAX (2.1 billion) elements. For string columns this imposes a ~2 GB limit, for int32 columns this imposes a ~8 GB limit, and for list columns this imposes a leaf element count <2.1 billion. Downstream libraries must partition their data to avoid these limits.
We expect that using a 64-bit size type will incur significant penalties to memory footprint and data throughput. Memory footprint will double for all offset vectors, and runtime of most functions will increase due to the larger data sizes. Kernel performance may degrade even further due to increased register count and unoptimized shared memory usage.
As GPUs increase in memory, the limit from a 32-bit cudf::size_type will force data partitions to become smaller fractions of device memory. Excessive data partitioning also leads to performance penalties, so libcudf should enable its community to start experimenting with a 64-bit size type. Scoping for 64-bit size types in the cuDF-python layer will be tracked in a separate issue (#TBD).
- [ ] Consult with thrust/cub experts about outstanding issues with 64-bit indexing. Some libcudf functions may depend on upstream changes in CCCL, please see cccl/47, thrust/1271, and cub/212.
copy_if,reduce,parallel_for,mergeandsortmay have unresolved issues. - [ ] Consult with thrust/cub experts about making 32-bit kernels optional. Currently the 64-bit kernels and disabled in libcudf builds. Disabling the 32-bit kernels would avoid large increases in compile time and binary size when we enable 64-bit thrust/cub kernels.
- [ ] Verify compatibility of 64-bit size type with cuco data structures (needs additional scoping)
- [ ] Audit custom kernels in libcudf for the impact of a 64-big size type. Introduce conditional logic to adjust shared memory allocations and threads per block as needed based on the size type. Identify implementation details that take a 32-bit size type for granted.
- [ ] Audit cuIO size types and their interaction with
cudf::size_type - [ ] Resolve compilation errors from using a 64-bit size type
- [ ] Resolve test failures from using a 64-bit size type
- [ ] Review performance impact of a 64-bit size type using libcudf microbenchmark results
- [ ] Add a build-time option for advanced users to use a 64-bit size type instead of a 32-bit size type.
- [ ] Add a CI step to build and test the 64-bit size type option.
From this stage we will have a better sense of the impact and value of using a 64-bit size type with libcudf.
On branch-23.10 commit ad9fa501192, I ran build.sh libcudf with a 64-bit size type and identified the unique lines that threw compilation errors.
Dictionary errors
cudf/cpp/src/groupby/sort/group_single_pass_reduction_util.cuh(64): error: no suitable conversion function from "cudf::dictionary32" to "cudf::size_type" exists
cudf/cpp/include/cudf/dictionary/detail/iterator.cuh(88): error: no suitable conversion function from "cudf::dictionary32" to "cudf::size_type" exists
cudf/cpp/include/cudf/detail/aggregation/aggregation.cuh(346): error: no suitable conversion function from "cudf::dictionary32" to "cudf::size_type" exists
cudf/cpp/src/groupby/sort/group_correlation.cu(87): error: no suitable conversion function from "cudf::dictionary32" to "cudf::size_type" exists
cudf/cpp/src/groupby/hash/multi_pass_kernels.cuh(107): error: no suitable conversion function from "cudf::dictionary32" to "cudf::size_type" exists
cudf/cpp/include/cudf/dictionary/detail/iterator.cuh(41): error: no suitable conversion function from "cudf::dictionary32" to "cudf::size_type" exists
AtomicAdd errors
cudf/cpp/include/cudf/detail/copy_if_else.cuh(97): error: no instance of overloaded function "atomicAdd" matches the argument list
cudf/cpp/include/cudf/detail/null_mask.cuh(108): error: no instance of overloaded function "atomicAdd" matches the argument list
cudf/cpp/src/groupby/sort/group_std.cu(153): error: no instance of overloaded function "atomicAdd" matches the argument list
cudf/cpp/src/bitmask/null_mask.cu(303): error: no instance of overloaded function "atomicAdd" matches the argument list
cudf/cpp/include/cudf/detail/valid_if.cuh(68): error: no instance of overloaded function "atomicAdd" matches the argument list
cudf/cpp/src/io/csv/csv_gpu.cu(274): error: no instance of overloaded function "atomicAdd" matches the argument list
cudf/cpp/src/io/csv/csv_gpu.cu(375): error: no instance of overloaded function "atomicAdd" matches the argument list
cudf/cpp/src/io/csv/csv_gpu.cu(207): error: no instance of overloaded function "atomicAdd" matches the argument list
cudf/cpp/src/io/csv/csv_gpu.cu(204): error: no instance of overloaded function "atomicAdd" matches the argument list
cudf/cpp/src/io/csv/csv_gpu.cu(264): error: no instance of overloaded function "atomicAdd" matches the argument list
cudf/cpp/src/io/csv/csv_gpu.cu(266): error: no instance of overloaded function "atomicAdd" matches the argument list
cudf/cpp/src/io/csv/csv_gpu.cu(281): error: no instance of overloaded function "atomicAdd" matches the argument list
cudf/cpp/src/io/csv/csv_gpu.cu(209): error: no instance of overloaded function "atomicAdd" matches the argument list
cudf/cpp/src/io/csv/csv_gpu.cu(283): error: no instance of overloaded function "atomicAdd" matches the argument list
Thrust
cudf/cpp/src/groupby/groupby.cu(269): error: no instance of overloaded function "std::transform" matches the argument list
cudf/cpp/src/copying/contiguous_split.cu(631): error: no instance of overloaded function "std::transform" matches the argument list
cudf/cpp/src/groupby/groupby.cu(304): error: no instance of overloaded function "std::all_of" matches the argument list
cudf/cpp/src/hash/md5_hash.cu(343): error: no instance of overloaded function "thrust::for_each" matches the argument list
cudf/cpp/include/cudf/lists/detail/scatter.cuh(245): error: no instance of overloaded function "thrust::sequence" matches the argument list
cudf/cpp/src/groupby/groupby.cu(313): error: no instance of overloaded function "std::transform" matches the argument list
cudf/cpp/src/filling/repeat.cu(124): error: no instance of overloaded function "thrust::upper_bound" matches the argument list
Device span errors
cudf/cpp/include/cudf/table/experimental/row_operators.cuh(848): error: no instance of constructor "std::optional<_Tp>::optional [with _Tp=cudf::device_span<const int, 18446744073709551615UL>]" matches the argument list
cudf/cpp/include/cudf/table/experimental/row_operators.cuh(848): error: no instance of constructor "std::optional<_Tp>::optional [with _Tp=cudf::device_span<const int32_t, 18446744073709551615UL>]" matches the argument list
int typing errors
cudf/cpp/src/binaryop/compiled/binary_ops.cuh(272): error: no instance of function template "cudf::util::div_rounding_up_safe" matches the argument list
cudf/cpp/include/cudf/detail/utilities/cuda.cuh(169): error: no instance of overloaded function "std::clamp" matches the argument list
Assorted errors
cudf/cpp/src/hash/spark_murmurhash3_x86_32.cu(230): error: no instance of overloaded function "std::max" matches the argument list
cudf/cpp/src/copying/purge_nonempty_nulls.cu(93): error: no instance of function template "cudf::detail::gather" matches the argument list
cudf/cpp/include/cudf/detail/copy_if.cuh(166): error: more than one instance of overloaded function "min" matches the argument list:
The java code right now hard codes a signed 32-bits as the size type in many places. We can switch it to 64-bits everywhere along with a dynamic check depending on how the code is compiled. But also just so you are aware Spark has a top level limitation of a singed 32-bit int for the number of rows in a table. We can work around this in some places, but moving the Spark plugin over to a 64-bit index is not going to be super simple.