unified-runtime [CUDA] Implement urKernelSuggestMaxCooperativeGroupCountExp for Cuda

This commit implements the experimental urKernelSuggestMaxCooperativeGroupCountExp, for the Cuda adapter, to retrieve the maximum number of cooperative groups that can be launched on the device.

Additionally, the changes also cache the result of the CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT Cuda driver query which is used to calculate the device wide maximum cooperative groups, because the Cuda occupancy query used has per SM (Multiprocessor) semantics.

Testing and related changes enabling querying this from SYCL: https://github.com/intel/llvm/pull/14333

Jun 27 '24 13:06 GeorgeWeb

2024-06-27T14:32:19.4840797Z Failed Tests (1):
2024-06-27T14:32:19.4849324Z   SYCL :: GroupAlgorithm/root_group.cpp

Jun 27 '24 16:06 pbalcer

2024-06-27T14:32:19.4840797Z Failed Tests (1):
2024-06-27T14:32:19.4849324Z   SYCL :: GroupAlgorithm/root_group.cpp

@pbalcer Yeah aware, thanks! The root group barrier is currently not supported correctly for cooperative-group kernels in the CUDA backend, so the intel/llvm corresponding PR will be XFAIL-ing it until it is implemented.

It previously passed because the query was returning a single group and it was calling a work-group level barrier rather than device-wide (cross-work-group).

Jun 27 '24 16:06 GeorgeWeb

After last rebase, there's a:

SYCL :: Regression/device_num.cpp

e2e failure that seems unrelated.

Sep 06 '24 11:09 GeorgeWeb