Index type
Int32 can be quite a bit faster and we should make sure that we use it where we can for our index calculations.
@luraess also mentioned that it would make sense to configure the hardware dimension index into the Kernel struct.
Could you provide a function that would evaluate differently depending on the device? e.g.
IT = KernelAbstractions.IndexType()
In which case or device would int32 not be sufficient?
The maximum linear index with UInt32 is 4,294,967,295 so an array of about 4GB. With GPUs having upwards of 40GB or more memory in the data canter, it's not unlikely that a user want to process something larger than that.
In particular ML