API for # of cores/multiprocessors
This may not be the best way to approach this, but to improve the heuristic deciding whether to reduce with blocks or with threads I'm thinking there should be a way to expose the number of cores.
See https://github.com/JuliaGPU/CUDA.jl/blob/e561e7a106684f8e4be59cad98a51cc304c671d2/src/mapreduce.jl#L163-L167 and https://github.com/JuliaGPU/Metal.jl/pull/626
I guess we would also need a way to access the max threads per block/group. Maybe we expose an API specifically for reductions that is essentially an interface for what CUDA has defined in big_mapreduce_threshold?
@vchuravy @maleadt @anicusan
Should probably update https://discourse.julialang.org/t/how-to-get-the-device-name-and-the-number-of-compute-units-when-using-oneapi-jl-or-amdgpu-jl/128361 once resolved
I think it does make sense to query the size of maximum number of workgroup size, and how many SM/Core are available.
We just need to be defensive and allow for a system to say "unknown" or "infinity"
Are you thinking both a value for "unknown" and another value for "infinite", or one value for both?
Backend state for max threads per group/block. All of the above are queried via the device:
- CUDA returns an
Int32(Cint) see mepreduce.jl for implementation - Metal returns an
MTLSize, which is a struct containing 3UIntvalues typically (1024,1024,1024) when in reality it's 1024 across the 3 dimensions, not >1 billion so maybe it can just be hardcoded as that until such a time when/if it changes - oneAPI returns an
IntmaxTotalGroupSizefromcompute_properties - AMDGPU returns an
Int32(Cint) through a similar interface to CUDA (attribute(device, hipDeviceAttributeMaxThreadsPerBlock)) - OpenCL
device.max_work_group_size::UInt32
Backend state for compute unit count:
- CUDA returns an
Int32(Cint) see mapreduce.jl for implementation - Metal at the moment would have to be done via a kind of hacky solution but I don't think there's a more official way. See https://github.com/JuliaGPU/Metal.jl/pull/626 Edit: See also better way (https://github.com/JuliaGPU/Metal.jl/pull/652)
- oneAPI **Not sure
- AMDGPU returns an
Int32(Cint) through a similar interface to CUDA (attribute(device, hipDeviceAttributeMultiprocessorCount)) - OpenCL
device.max_compute_units::Csize_t
Would it be feasible to required Backends to have a device property (and be associated to a device) so that such properties can be queried properly? In Metal's case they all only have one device so that wouldn't be a problem (for the foreseeable future) but it could for other backends.
Oof, I have been avoiding adding a notion of device (besides the device switching API https://github.com/JuliaGPU/KernelAbstractions.jl/blob/1ac546fc59cc611d749fa7a50e4a1efa3393851b/src/KernelAbstractions.jl#L601-L627)
Would it be enough to just have the backend implementation call the active device? For example the CUDA implementations would look something like:
function KA.max_threads_per_workgroup(::CUDABackend)
attribute(active_state().device, DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK)
end
function KA.compute_unit_count(::CUDABackend)
attribute(active_state().device, DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT)
end
Also the docstring should make it clear that the max workgroup size is a theoretical limit and that it shouldn't be used for kernel thread count limit decisions because that's all that Metal can guarantee until the kernel is defined and the compute pipeline instantiated.
I would say yes. But maybe @maleadt has some thoughts.
I have been avoiding adding a notion of device
I'm not sure that's tenable, e.g., in the case of the OpenCLBackend there are going to be large differences between the different devices. Punting that onto the currently active device is one option, but it also makes the API much more hand-wavey. What's the reason you want to avoid that notion?
While this issue is more specific, I just noticed that this is essentially #617. Maybe we could have a KA.attribute function to query a subset of device features that are desired and available in the major backends? We'd have to make sure to be very specific with regards to return type
it also makes the API much more hand-wavey
Not familiar enough to comment on whether this should be the final API or not, but maybe the feature query function could take in a device index a an argument like below? This would be much less hand-wavey than my suggestion above.
From https://github.com/JuliaGPU/KernelAbstractions.jl/issues/617#issuecomment-3027396411:
function NNop._shared_memory(::CUDABackend, device_id::Integer)
dev = collect(CUDA.devices())[device_id]
return UInt64(CUDA.attribute(dev, CUDA.CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK))
end
a device index
OpenCL devices are not identified by a number (but by a handle), and the order of querying them is unspecified, so we cannot use simple integer indices.