KernelAbstractions.jl icon indicating copy to clipboard operation
KernelAbstractions.jl copied to clipboard

API for # of cores/multiprocessors

Open christiangnrd opened this issue 5 months ago • 9 comments

This may not be the best way to approach this, but to improve the heuristic deciding whether to reduce with blocks or with threads I'm thinking there should be a way to expose the number of cores.

See https://github.com/JuliaGPU/CUDA.jl/blob/e561e7a106684f8e4be59cad98a51cc304c671d2/src/mapreduce.jl#L163-L167 and https://github.com/JuliaGPU/Metal.jl/pull/626

I guess we would also need a way to access the max threads per block/group. Maybe we expose an API specifically for reductions that is essentially an interface for what CUDA has defined in big_mapreduce_threshold?

@vchuravy @maleadt @anicusan

Should probably update https://discourse.julialang.org/t/how-to-get-the-device-name-and-the-number-of-compute-units-when-using-oneapi-jl-or-amdgpu-jl/128361 once resolved

christiangnrd avatar Jul 20 '25 21:07 christiangnrd

I think it does make sense to query the size of maximum number of workgroup size, and how many SM/Core are available.

We just need to be defensive and allow for a system to say "unknown" or "infinity"

vchuravy avatar Jul 21 '25 09:07 vchuravy

Are you thinking both a value for "unknown" and another value for "infinite", or one value for both?

Backend state for max threads per group/block. All of the above are queried via the device:

  • CUDA returns an Int32 (Cint) see mepreduce.jl for implementation
  • Metal returns an MTLSize, which is a struct containing 3 UIntvalues typically (1024,1024,1024) when in reality it's 1024 across the 3 dimensions, not >1 billion so maybe it can just be hardcoded as that until such a time when/if it changes
  • oneAPI returns an Int maxTotalGroupSize fromcompute_properties
  • AMDGPU returns an Int32 (Cint) through a similar interface to CUDA (attribute(device, hipDeviceAttributeMaxThreadsPerBlock))
  • OpenCL device.max_work_group_size::UInt32

Backend state for compute unit count:

  • CUDA returns an Int32 (Cint) see mapreduce.jl for implementation
  • Metal at the moment would have to be done via a kind of hacky solution but I don't think there's a more official way. See https://github.com/JuliaGPU/Metal.jl/pull/626 Edit: See also better way (https://github.com/JuliaGPU/Metal.jl/pull/652)
  • oneAPI **Not sure
  • AMDGPU returns an Int32 (Cint) through a similar interface to CUDA (attribute(device, hipDeviceAttributeMultiprocessorCount))
  • OpenCL device.max_compute_units::Csize_t

christiangnrd avatar Jul 21 '25 17:07 christiangnrd

Would it be feasible to required Backends to have a device property (and be associated to a device) so that such properties can be queried properly? In Metal's case they all only have one device so that wouldn't be a problem (for the foreseeable future) but it could for other backends.

christiangnrd avatar Jul 21 '25 17:07 christiangnrd

Oof, I have been avoiding adding a notion of device (besides the device switching API https://github.com/JuliaGPU/KernelAbstractions.jl/blob/1ac546fc59cc611d749fa7a50e4a1efa3393851b/src/KernelAbstractions.jl#L601-L627)

vchuravy avatar Jul 21 '25 20:07 vchuravy

Would it be enough to just have the backend implementation call the active device? For example the CUDA implementations would look something like:

function KA.max_threads_per_workgroup(::CUDABackend)
    attribute(active_state().device, DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK)
end

function KA.compute_unit_count(::CUDABackend)
    attribute(active_state().device, DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT)
end

Also the docstring should make it clear that the max workgroup size is a theoretical limit and that it shouldn't be used for kernel thread count limit decisions because that's all that Metal can guarantee until the kernel is defined and the compute pipeline instantiated.

christiangnrd avatar Jul 21 '25 22:07 christiangnrd

I would say yes. But maybe @maleadt has some thoughts.

vchuravy avatar Jul 22 '25 10:07 vchuravy

I have been avoiding adding a notion of device

I'm not sure that's tenable, e.g., in the case of the OpenCLBackend there are going to be large differences between the different devices. Punting that onto the currently active device is one option, but it also makes the API much more hand-wavey. What's the reason you want to avoid that notion?

maleadt avatar Jul 29 '25 07:07 maleadt

While this issue is more specific, I just noticed that this is essentially #617. Maybe we could have a KA.attribute function to query a subset of device features that are desired and available in the major backends? We'd have to make sure to be very specific with regards to return type

it also makes the API much more hand-wavey

Not familiar enough to comment on whether this should be the final API or not, but maybe the feature query function could take in a device index a an argument like below? This would be much less hand-wavey than my suggestion above.

From https://github.com/JuliaGPU/KernelAbstractions.jl/issues/617#issuecomment-3027396411:

function NNop._shared_memory(::CUDABackend, device_id::Integer)
    dev = collect(CUDA.devices())[device_id]
    return UInt64(CUDA.attribute(dev, CUDA.CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK))
end

christiangnrd avatar Jul 30 '25 03:07 christiangnrd

a device index

OpenCL devices are not identified by a number (but by a handle), and the order of querying them is unspecified, so we cannot use simple integer indices.

maleadt avatar Jul 30 '25 06:07 maleadt