[CUDA] Implement urKernelSuggestMaxCooperativeGroupCountExp for Cuda
This commit implements the experimental urKernelSuggestMaxCooperativeGroupCountExp, for the Cuda adapter, to retrieve the maximum number of cooperative groups that can be launched on the device.
Additionally, the changes also cache the result of the CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT Cuda driver query which is used to calculate the device wide maximum cooperative groups, because the Cuda occupancy query used has per SM (Multiprocessor) semantics.
Testing and related changes enabling querying this from SYCL: https://github.com/intel/llvm/pull/14333
2024-06-27T14:32:19.4840797Z Failed Tests (1):
2024-06-27T14:32:19.4849324Z SYCL :: GroupAlgorithm/root_group.cpp
2024-06-27T14:32:19.4840797Z Failed Tests (1): 2024-06-27T14:32:19.4849324Z SYCL :: GroupAlgorithm/root_group.cpp
@pbalcer Yeah aware, thanks! The root group barrier is currently not supported correctly for cooperative-group kernels in the CUDA backend, so the intel/llvm corresponding PR will be XFAIL-ing it until it is implemented.
It previously passed because the query was returning a single group and it was calling a work-group level barrier rather than device-wide (cross-work-group).
After last rebase, there's a:
SYCL :: Regression/device_num.cpp
e2e failure that seems unrelated.