llvm [SYCL][Ext] Query kernel maximum active work-groups based on occupancy

The currently proposed and implemented query is max_num_work_group_occupancy_per_cu which retrieves the maximum actively executing workgroups based on compute unit occupancy granularity.

This commit also overloads the max_num_num_work_group_sync query to take extra parameters for local work-group size and local dynamic memory size (in bytes) in order to be allow users to pass those important resource usage factors to the query, so they are take in account in the final group count suggestion. This overload is currently only usable when targetting Cuda.

Jun 27 '24 16:06 GeorgeWeb

This is required for CUTLASS. I am converting to DRAFT for now as there is a unified-runtime dependency: https://github.com/oneapi-src/unified-runtime/pull/1796.

Update: Ready for review.

Jun 27 '24 16:06 GeorgeWeb

@gmlueck

Can you update the wording in the PR description? It talks about max_num_num_work_group_sync, which I guess is an old name for the query?

I will update it. The PR description regarding updating max_num_work_group_sync, which is different from the extension query I added, is just noting that I overloaded ext_oneapi_get_info for the former with extra arguments (range<3>, size_t), because that's how it is used in the mapping to the Cuda implementation in UR. This overload may be done in a separate PR if it's confusing to be in the current one though.

This overload is currently only usable when targetting Cuda.

What is preventing us from implementing the API on Level Zero?

Good question. Nothing is really stopping Level Zero from implementing this, in fact it is already implemented. I think the reason I said that was because the ext_oneapi_get_info(queue, range<3>, size_t) overload doesn't really change anything for Level Zero from just using the ext_oneapi_get_info(queue) one, because the zeKernelSuggestMaxCooperativeGroupCount API (which has per-device semantics) does not take limiting factors such as work-group size or local memory size. Hence, I just meant to say it not useful for Level-Zero at the moment, because the extra arguments will be ignored in the UR L0 adapter implementation, but it still works. So that was wrong phrasing from me.

If we cannot implement it on Level Zero, we should add a section "Backend support status" to the spec indicating what is supported. However, it would be better to implement it universally from the start.

I've added a "Backend support status" section. The HIP adapter currently returns 1, but we'll add the correct support very soon with a separate [HIP] PR in unified-runtime.

Jul 04 '24 10:07 GeorgeWeb

Thanks for clarifying the spec, @GeorgeWeb. Now that I understand the intent a littler better, I wonder if we should just merge these queries into sycl_ext_oneapi_launch_queries. I think that wouldn't be such a huge change since you depend on that extension already. The sycl_ext_oneapi_launch_queries extension already has queries named max_work_group_size and max_num_work_groups, so it would be very natural to add queries named recommended_work_group_size and recommended_num_work_groups. Tagging @Pennycook here also for his opinion.

I'd suggest changing the description to be more generic, though, so that it can be implemented on any backend. For example, recommended_num_work_groups might be described like this:

Returns the recommended number of work-groups, when the kernel is submitted to the specified queue with the specified work-group size and the specified amount of dynamic work-group local memory (in bytes). This value is always less than or equal to the value returned by max_num_work_groups (when given the same query arguments).

[Note: On CUDA backends, this query returns the number of work-groups that will maximize the occupancy of the device's compute units. -- end note]

(Feel free to tweak the note about CUDA if what I have isn't right.)

I think we can implement this on any backend because we can always fall back and return the same value as max_num_work_groups.

Jul 09 '24 20:07 gmlueck

The sycl_ext_oneapi_launch_queries extension already has queries named max_work_group_size and max_num_work_groups, so it would be very natural to add queries named recommended_work_group_size and recommended_num_work_groups. Tagging @Pennycook here also for his opinion.

This makes sense to me.

I especially like the suggestion for this to return the total recommended number of work-groups instead of the recommended number of work-groups per compute unit. I don't think the definition of "compute unit" is strict enough in OpenCL/SYCL to be useful, so the less we rely on it the better. Additionally, requiring developers to multiply the return value of a query by the return value of a different query seems unwieldy compared to us just returning the result of that multiplication directly.

Jul 09 '24 21:07 Pennycook

I think we can implement this on any backend because we can always fall back and return the same value as max_num_work_groups.

@gmlueck I see your point. Thanks for the good observations and as always appreciate your experience in making extension docs more clear and concise. I am going to use the suggested wording.

One question I have is would would that mean implementing the max_num_work_groups from the sycl_ext_oneapi_launch_queries extension as well, in order to fall-back appropriately. Is my understanding correct?

Additionally, the CUTLASS port in our codeplaysoftware/cutlass-fork does require a query with semantics returing the number of work-groups per compute-unit, so while recommended_num_work_groups can have device/cross-work-group semantics, we'd still need a separate recommended_num_work_groups_per_cu that uses the above and divides the number of compute-units etc. (implementation details). There are number of uses for this, a generic one, for example, is to update an sm_occupancy (as in number of work)-groups per SM) value that is passed to gemm kernels and exposed to users so they can use as they need cutlass-code-search=sm_occupancy.

I especially like the suggestion for this to return the total recommended number of work-groups instead of the recommended number of work-groups per compute unit.

@Pennycook I was likely unclear on the need of those semantics. Also, while the definition of "compute unit" is strict enough in OpenCL/SYCL, it can still be useful and understood appropriately in this case because we return the number of CUs accordingly to the underlying hardware for each target in our backends. An example is the info::device::max_compute_units device query. Considering the need for it, would you be willing to accept adding an extra recommended_num_work_groups_per_cu in addition to recommended_num_work_groups (which will have device semantics) or would you suggest we make a more codeplay-specific extension rather than oneapi?

Jul 15 '24 14:07 GeorgeWeb

Additionally, the CUTLASS port in our codeplaysoftware/cutlass-fork does require a query with semantics returing the number of work-groups per compute-unit, so while recommended_num_work_groups can have device/cross-work-group semantics, we'd still need a separate recommended_num_work_groups_per_cu that uses the above and divides the number of compute-units etc. (implementation details). There are number of uses for this, a generic one, for example, is to update an sm_occupancy (as in number of work)-groups per SM) value that is passed to gemm kernels and exposed to users so they can use as they need cutlass-code-search=sm_occupancy.

@Pennycook I was likely unclear on the need of those semantics. Also, while the definition of "compute unit" is strict enough in OpenCL/SYCL, it can still be useful and understood appropriately in this case because we CUs accordingly to the underlying hardware for each target in our backends. An example is the info::device::max_compute_units device query. Considering the need for it, would you be willing to accept adding an extra recommended_num_work_groups_per_cu in addition to recommended_num_work_groups (which will have device semantics) or would you suggest we make a more codeplay-specific extension rather than oneapi?

If the only way to interpret info::device::max_compute_units meaningfully is to know what (device, backend) combination you're running on, what do we gain by exposing it as a general query? If you're going to have to write code with separate branches for NVIDIA and Intel GPUs to convert between CUs, SMs and Xe Cores, wouldn't it be easier (and cleaner) to just call the underlying backend directly?

Jul 15 '24 14:07 Pennycook

If the only way to interpret info::device::max_compute_units meaningfully is to know what (device, backend) combination you're running on, what do we gain by exposing it as a general query? If you're going to have to write code with separate branches for NVIDIA and Intel GPUs to convert between CUs, SMs and Xe Cores, wouldn't it be easier (and cleaner) to just call the underlying backend directly?

@Pennycook info::device::max_compute_units is defined in the spec already, https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_device_information_descriptors. I don't see any need for branching as the info::device::max_compute_units query is already returning SMs on Nvidia-CUDA, EUs/Xe Cores on Intel GPUs, and CUs on AMD GPUs. It is a core query, implemented on all adapters/backends. I am not trying to argue that, but if we are wary of compute-units, shouldn't we have avoided any use of them, especially in generic queries in the core spec in the first place? I can see that "compute unit" maps to the high-level definition of "multiprocessor" in Cuda and HIP or at least that's how it is interpreted in the implementation. Respectfully, I just want to understand the meat of the issue better if possible.

Jul 15 '24 15:07 GeorgeWeb

@Pennycook info::device::max_compute_units is defined in the spec already, https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_device_information_descriptors. I don't see any need for branching as the info::device::max_compute_units query is already returning SMs on Nvidia-CUDA, EUs/Xe Cores on Intel GPUs, and CUs on AMD GPUs. It is a core query, implemented on all adapters/backends. I am not trying to argue that, but if we are wary of compute-units, shouldn't we have avoided any use of them, especially in generic queries in the core spec in the first place? I can see that "compute unit" maps to the high-level definition of "multiprocessor" in Cuda and HIP or at least that's how it is interpreted in the implementation. Respectfully, I just want to understand the meat of the issue better if possible.

I generally discourage people from using info::device::max_compute_units because of the issues that I perceive here. I typically argue against adding new usages of this query (and the "compute unit" concept in general) because any expansion in its usage would make it harder to deprecate/remove in the future.

I don't understand how you can avoid branching here, because I think the units are fundamentally different across backends. I may not understand what you're trying to do, completely, so I'll expand on what I think the differences are below.

On NVIDIA GPUs a compute unit is a Streaming Multiprocessor (SM), which is a collection of CUDA cores, registers, shared memory, etc. The number of warps you can schedule concurrently is limited by SM resources, the number of registers per thread, the amount of shared memory, etc.

On Intel GPUs a compute unit is an Execution Unit (EU) or XE Vector Engine (XVE), which has its own register file. The maximum number of sub-groups you can schedule concurrently on an EU is 8. But the shared memory is shared between the 8 EUs within the same Xe Core. So, the maximum number of sub-groups you can schedule concurrently is limited by XE Core resources, rather than by EU resources.

It feels to me like at some point, a user of your extension would be required to multiply the value returned by Intel GPUs by 8, or otherwise account for the fact that any shared memory limit is not "per compute unit".

Does that make sense?

If we think that users need a value between 0 and 1 that represents how efficiently a given configuration will use the hardware, why not just expose a query that returns that value directly?

Jul 15 '24 15:07 Pennycook

On Intel GPUs a compute unit is an Execution Unit (EU) or XE Vector Engine (XVE), which has its own register file. The maximum number of sub-groups you can schedule concurrently on an EU is 8. But the shared memory is shared between the 8 EUs within the same Xe Core. So, the maximum number of sub-groups you can schedule concurrently is limited by XE Core resources, rather than by EU resources.

It feels to me like at some point, a user of your extension would be required to multiply the value returned by Intel GPUs by 8, or otherwise account for the fact that any shared memory limit is not "per compute unit".

Does that make sense?

Very very insightful, thank you @Pennycook! I was a little muddy on that part. I can see how messy it can get because we cannot assume the mapping and I can see why you discourage the relying on compute-unit queries for that matter. Reliable portability is hard in this case. I will pause here for a moment and have a think. Appreciate this explanation here a lot.

Jul 15 '24 15:07 GeorgeWeb

@Pennycook and @gmlueck Circling back to this. We are agreement with your argument above and decided to end up just implementing max_num_work_groups with device semantics and let SYCLcompat handle the cuda-specific SM semantic conversion which maps ideally to what the CUTLASS sycl fork needs at the moment. So, thanks a bunch for your input! A future work on recommended_num_work_groups and recommended_work_group_size (linear) can be a follow-up to this but not required as of now. Thanks for your input on all this.

Additionally, I've left the max_num_work_groups_sync in this PR for now, so if it is planned to be effectively replaced by max_num_work_groups it can be done in a follow-up.

Sep 03 '24 10:09 GeorgeWeb

@GeorgeWeb UR PR merged, please update UR tag and fix the conflict to merge this, Thanks!

Sep 10 '24 11:09 omarahmed1111

Unrelated passing XFAIL: https://github.com/intel/llvm/actions/runs/10809701571/job/29985720662?pr=14333#step:22:2286

********************
Unexpectedly Passed Tests (1):
  SYCL :: OneapiDeviceSelector/no_duplicate_devices.cpp

see #15341 and #15288

Sep 11 '24 11:09 GeorgeWeb

Wrt the Doxygen docs build failure: see issue https://github.com/intel/llvm/issues/15355 for the failing Generate Doxygen documentation build action. The link pointing to the now removed kernel-fusion extension is wrong.

Sep 11 '24 12:09 GeorgeWeb

@intel/llvm-gatekeepers Please merge, the failures there are unrelated.

Sep 11 '24 12:09 omarahmed1111

Note for other gatekeepers who may come here from failed post-commit. Those failures were noticed, we had already synced with @GeorgeWeb about them and there is a follow-up PR expected with slight tweaks to tests to fix that failure

Sep 11 '24 14:09 AlexeySachkov

As @AlexeySachkov already noted (thanks). The follow-up PR that should fix the failure is this one https://github.com/intel/llvm/pull/15359.

Sep 11 '24 15:09 GeorgeWeb