Allow RDC builds without CDP support
The following code leads to excessive memory footprint when compiled with -rdc=true:
#include <iostream>
#include <cub/device/device_segmented_sort.cuh>
template class cub::DispatchSegmentedSort<false, int, int, int,int*, int*>;
int main() {
size_t free_byte{}, total_byte{};
double used{};
cudaMemGetInfo(&free_byte, &total_byte);
used = static_cast<double>(total_byte - free_byte) / 1024.0 / 1024.0;
std::cout << "CUB " << CUB_VERSION << ": " << used << " MB" << std::endl;
return 0;
}
:nvcc -O3 -rdc=true testSort.cu
:./a.out
CUB 200000: 1410.75 MB
:nvcc -O3 testSort.cu
:./a.out
CUB 200000: 616.75 MB
The issue can be reduced to the following code:
#include <iostream>
__global__ void continuation() { }
__global__ void kernel()
{
continuation<<<1, 256>>>();
__cudaDeviceSynchronizeDeprecationAvoidance();
}
int main() {
size_t free_byte{}, total_byte{};
double used{};
cudaMemGetInfo(&free_byte, &total_byte);
used = static_cast<double>(total_byte - free_byte) / 1024.0 / 1024.0;
std::cout << used << " MB" << std::endl;
}
The __cudaDeviceSynchronizeDeprecationAvoidance is used on device only when debug_synchronous is set. I suggest we restrict "synchronous" part if the option in the case of device-side usage.
The second issue here is that we take -rdc=true as a desire to launch device-scope algorithms from device. Although -rdc=true is a requirement for device-side launches, it doesn't necessarily used for that purpose. The option might indicate that device code is distributed in a library, or that some device code is compiled in a different TU. I suggest we take a closer look at the overheads associated with device-side launches support and in case ones are found, separate rdc concept from device-side launch support.
Agreed, this does sound like a reasonable option. We'd need to default to using the RDC state, but provide an opt-out for CDP support.
Bumping priority after taking a closer look at this:
:nvcc -O3 -rdc=true testSort.cu
:./a.out
CUB 200000: 1410.75 MB
:nvcc -O3 testSort.cu
:./a.out
CUB 200000: 616.75 MB
😬