CUDA.jl, oneAPI.jl, etc. all provide a ...Device type, but without a common supertype.

Likewise, our GPU packages all provide functionality to get the device that a given array lives on, but each defines it's own function for it. The latter was partially addressed in (JuliaGPU/KernelAbstractions.jl#269) but it's not an elegant solution (all Requires-based) and KernelAbstractions is a heavy dependency. This makes it tricky to address issues like JuliaArrays/ElasticArrays.jl#44. Some of the code mentioned in JuliaGPU/GPUArrays.jl#409 could also lighten it's dependencies with common "get-device" functionality (LinearSolve.jl, for example, seems to need GPUArray.jl only for a b isa GPUArrays.AbstractGPUArray, similar for DiffEqSensitivity.jl

This PR and supporting PRs for CUDA.jl, AMDGPU.jl, oneAPI.jl and KernelAbstractions.jl attempt to establish a common supertype for computing devices, and support for

get_computing_device(x)::AbstractComputingDevice: Get the device x lives on, not limited to arrays (could e.g. be a whole ML model)
Adapt.adapt_storage(dev, x): Move x to device dev.
Sys.total_memory(dev): Get the total memory on dev
Sys.free_memory(dev): Get the free memory on dev

I think this will make it much easier to write generic device-independent code:

Being able to query if data lives on a CPU or GPU without taking on heavy dependencies should comes in useful in many packages.
Writing adapt(CuDevice(n), x) as an alternative to adapt(CuArray, x) seems very natural (esp. in multi-GPU scenarios): It corresponds to the user saying "let's run it on that GPU" instead of "with a different array type".
Having the ability to query total and available memory can help with finding the right data chunk sizes before sending data to a device with adapt(dev, data_partition).

This PR defines AbstractComputingDevice, AbstractGPUDevice and implements GPUDevices. It's very little code, there should be no load-time impact.

CC @vchuravy, @maleadt, @jpsamaroo, @ChrisRackauckas, @findmyway

Status:

no tests yet while waiting for design comments from package maintainers
CPU: functional
CUDA: functional with with JuliaGPU/CUDA.jl#1520 (CuDevice <: Adapt.AbstractGPUDevices)
AMDGPU: needs expert advice, draft is here: JuliaGPU/AMDGPU.jl#233
oneAPI: functional (but not complete) with JuliaGPU/oneAPI.jl#185 (ZeDevice <: Adapt.AbstractGPUDevice)
KernelAbstractions: funcional on CUDA with JuliaGPU/KernelAbstractions.jl#297 (replace KernelAbstractions.Device with Adapt.AbstractComputingDevice)

May 22 '22 14:05 oschulz

@tkf does this jive with what you need for Loops/Executor?

May 22 '22 15:05 vchuravy

Maybe instead of ComputingDevice we call it ComputeUnit? We are moving towards heterogeneous system in general and they might not be separated devices

May 22 '22 15:05 vchuravy

Maybe instead of ComputingDevice we call it ComputeUnit? We are moving towards heterogeneous system in general and they might not be separated devices

Sure, absolutely!

Since this touches several PRs maybe I should wait for feedback on the general idea from @maleadt and @jpsamaroo before doing renames and so on?

May 22 '22 15:05 oschulz

@maleadt and @jpsamaroo do you have some initial feedback? And sorry for the state that the AMDGPU part of this is still in @jpsamaroo, I'll need a few pointers on that. :-)

May 23 '22 17:05 oschulz

This seems sensible to me, but I don't understand why it belongs in Adapt.jl. The only purpose of this package is to provide utilities to convert complex object structures, and is (in principle) unrelated to GPU/array programming. Now, if the proposed device identification were based on the existing Adapt.jl infrastructure (adapt_structure/adapt_storage, although it would probably need to be generalized) I could understand putting it here, but it currently isn't (i.e. echoing @tfk's comments, https://github.com/JuliaGPU/Adapt.jl/pull/52/files#r879031411).

May 24 '22 09:05 maleadt

Codecov Report

Merging #52 (76c686d) into master (d9f852a) will decrease coverage by 30.31%. The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           master      #52       +/-   ##
===========================================
- Coverage   81.48%   51.16%   -30.32%     
===========================================
  Files           5        6        +1     
  Lines          54       86       +32     
===========================================
  Hits           44       44               
- Misses         10       42       +32

Impacted Files	Coverage Δ
src/Adapt.jl	`100.00% <ø> (ø)`
src/computedevs.jl	`0.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d9f852a...76c686d. Read the comment docs.

May 24 '22 09:05 codecov[bot]

@maleadt: This seems sensible to me, but I don't understand why it belongs in Adapt.jl.

It seemed to be a natural place, both semantically and dependency-wise.

Semantically, adapt deals with moving storage between devices (at least that's what we use it for, mostly), so it seems natural to have a concept of what a computing device is in here.

Dependency-wise, pretty much all code that will need to define/use get_computing_device depends on Adapt already. We could create a super-lightweight package ComputingDevices.jl - but Adapt.jl would need to depend on it, so we can define adapt_storage(::CPUDevice, x). I'd be fine with that, but I thought people might prefer not to add such a package since Adapt.jl is already super-lightweight, and code that would need ComputingDevices.jl would very likely need Adapt.jl as well anyway.

Now, if the proposed device identification were based on the existing Adapt.jl infrastructure (adapt_structure/adapt_storage

With adapt_...(dev, x) (and I think that would be very good to have, see above) it's already integrated, in a way. And even if we only have a simple default implementation of get_computing_device that uses parent/buffer for now, it would certainly be good to expand in the direction that @tkf suggested ("generalizing Adapt.adapt_structure"). But I think that can be done as a second step (and would benefit Adapt as a whole).

We can always spin-off a separate package ComputingDevices.jl later on, if the need arises.

May 24 '22 11:05 oschulz

some part of this could possibly go in GPUArrays, but having it in a more lightweight package instead is definitely appealing

Yes, one important motivation here is to enable generic code to be aware of devices without taking on a heavy dependency like GPUArrays.

May 26 '22 07:05 oschulz

@vchuravy , @ChrisRackauckas , @tkf , @maleadt , @jpsamaroo thanks for all the inital feedback!

I've updated this PR and tried to include your suggestions - since a lot has changed I've marked the discussions above as resolved, but please do reopen them if issues haven't been addressed in the PR changes and the remarks below:

Renamed AbstractComputingDevice to AbstractComputeUnit (suggested by @vchuravy) and renamed get_compute_device to get_compute_unit accordingly.
Added AbstractComputeAccelerator (suggested by @ChrisRackauckas) in between AbstractComputeUnit and AbstractGPUDevice (is the type hierarchy too deep now?).
Using bottom values if compute device can't be determined or unified/merged (suggested by @tkf)
Renamed select_computing_device (suggestions by @tkf and @jpsamaroo) - it's called merge_compute_units now. It can actually be more of a combination than a promotion, I think - for example, multiple CUDA devices in the same box could be merged to a MultiCUDASystem([0,1,2,3]) or so in the future. Currently, get_compute_unit and merge_compute_units will return MixedComputeSystem() if different CPU/GPU devices are involved.
The generic implementation of get_compute_unit defends against reference cycles (pointed out by @tkf) now.

I think having an automatic recursive compute unit resolution will be important so that get_compute_unit_impl won't have to be specialized for most types - and we should support closures. This works:

julia> mllayer = let A = cu(rand(5,5)), b = cu(rand(5)), f = x -> x < 0 ? zero(x) : x
           x -> f.(A * b .+ b)
       end;

julia> x = cu(rand(5));

julia> get_compute_unit((mllayer, x))
CuDevice(0): NVIDIA GeForce

@tkf suggested a "fold over 'relevant data source objects' in a manner generalizing Adapt.adapt_structure". The generated default get_compute_unit_impl does such a fold, but since it doesn't need to reconstruct objects it's not limited to tuples and functions like the current Adapt.adapt_structure. It currently uses all fields - to my knowledge we don't have a widely supported way to get the "relevant data source objects" yet (that would be neat, though!). I suggest we pursue establishing such a standard separately (I suggested something along those lines in JuliaObjects/ConstructionBase.jl#54) and then use it in get_compute_unit_impl when available (even fewer types would need to specialize get_compute_unit_impl then).

@maleadt raised the question whether the compute unit concept belongs into Adapt. I strongly feel that it needs to be in a very lightweight package (so that generic code can be aware of compute devices without heavy deps, i.e. GPUArrays.jl is definitely too heavy for it). I would argue that Adapt is a good place, as adapt(::AbstractComputeUnit, x) is part of it. We could create a package ComputeUnits.jl, but Adapt.jl would have to depend on it. My suggestion would be to add the compute unit concept to Adapt.jl - we can still split off a ComputeUnits.jl in the future if necessary.

May 26 '22 12:05 oschulz

(Closed by accident.)

May 26 '22 12:05 oschulz

@ChrisRackauckas how do you think this should be integrated with ArrayInterfaceCore: AbstractDevice, AbstractCPU?

Jun 05 '22 19:06 oschulz

I'm not sure. @Tokazama might have opinions.

Jun 06 '22 10:06 ChrisRackauckas

This seems a lot like what we do in ArrayInterfaceCore. We don't have anything that merges device types and we account for a difference in CPU types that are built on different memory objects like a tuple

Jun 06 '22 15:06 Tokazama

This seems a lot like what we do in ArrayInterfaceCore.

Is this already in active use somewhere? I think we should definitely merge this in one lightweight place (ArrayInterfaceCore or Adapt) so that we establish a ecosystem-wide concept of a computing device/node.

Jun 06 '22 19:06 oschulz

I think the loop vectorization stuff uses it a lot.

Jun 06 '22 19:06 Tokazama

I think the loop vectorization stuff uses it a lot.

Ok, so we can't just simply replace it - but I assume people will in general be in favor or merging this with the proposal in this PR (meaning adapt() and KernelAbstractions support, and having the GPU packages define the actual device types they are responsible for, to create clean and lean dependency trees)?

Jun 06 '22 20:06 oschulz

As you can see here, we could do a better job of handling the actual device types in terms of documentation and additional methods. But we have a pretty good system for deriving the device type here.

Jun 06 '22 20:06 Tokazama

We have some nice tricks for avoiding generated methods using Static.jl too BTW. I haven't taken time to really dig in here and try it out, but it's worth thinking about.

Jun 06 '22 20:06 Tokazama

But we have a pretty good system for deriving the device type

Sure! Still, explicit support from the GPU packages would be nicer, right? And it's the only way that allows the user to handle systems with different GPUs cleanly, and also the only really clean way to bring KernelAbstractions into the mix as well.

We have some nice tricks for avoiding generated methods using Static.jl too BTW.

Nice! Could that be used to make the structural fold in this PR type stable without generated code (and how)?

Jun 06 '22 20:06 oschulz

However we do this I think we should really aim to establish a single type tree for computing devices across the ecosystem, one that reaches down to different GPU (TPU, ...) types. And I think it would be very beneficial for users if they can write adapt(some_device, some_ml_model_or_other_big_thing).

So we'll need a depenency between Adapt and ArrayInterfaceCore - the question is, which way round? Package maintainers, over to you. :-)

Jun 06 '22 20:06 oschulz

Sure! Still, explicit support from the GPU packages would be nicer, right? And it's the only way that allows the user to handle systems with different GPUs cleanly, and also the only really clean way to bring KernelAbstractions into the mix as well.

Explicit support for specific GPU devices is something we definitely want. It just takes buy in from those packages, which up until now was hard to get because we had so much latency. We still need to work out all the bugs to update to Static.jl v0.7 (which fixes a bunch of invalidations), but we're already at a load time of ~0.06 seconds on Julia v1.7.

Nice! Could that be used to make the structural fold in this PR type stable without generated code (and how)?

From an initial look I'd say that StaticSymbol uses an internal generated function to merge Symbols. I'm pretty sure there's no other way to safely combine symbols that's also type stable.

Jun 06 '22 20:06 Tokazama

However we do this I think we should really aim to establish a single type tree for computing devices across the ecosystem, one that reaches down to different GPU (TPU, ...) types. And I think it would be very beneficial for users if they can write adapt(some_device, some_ml_model_or_other_big_thing).

So we'll need a depenency between Adapt and ArrayInterfaceCore - the question is, which way round? Package maintainers, over to you. :-)

Those maintaining GPU packages will know better if some approach has quirks that will work with their internals, so I'm just trying to explain the advantages of what we've developed thus far. The main advantage is we work pretty hard to navigate type wrappers in the most robust way we can so that if you grab a pointer from something you also can figure out what indices/strides/etc are valid.

Jun 06 '22 21:06 Tokazama

The main advantage is we work pretty hard to navigate type wrappers in the most robust way we can so that if you grab a pointer from something you also can figure out what indices/strides/etc are valid.

Nice! I guess we'll have to combine that with the Adapt-style "structural descend", since we don't just want to cover arrays, but also deeply nested objects in general (MC and statistics model, complex data and so on)?

I guess the question now is will Adapt depend on ArrayInterfaceCore or the other way round. Both have a load time under 1 ms now, so from a user perspective it won't matter much, and both are pretty much unavoidable dependencies in any real project anyway. Depending on preferences of the Adapt and ArrayInterface maintainers I can then try to merge this PR with what's in AI, either here or there.

Jun 07 '22 06:06 oschulz

@Tokazama : ArrayInterfaceCore. We don't have anything that merges device types and we account for a difference in CPU types that are built on different memory objects like a tuple

The equivalent to ArrayInterface.CPUTuple would be Adapt.ComputingDeviceIndependent in this PR (just by looking at, e.g., a StaticArray we can't know if we're currently on a CPU or GPU without additional context, and bitstypes don't really have to be "adapted" between computing devices).

Jun 08 '22 08:06 oschulz

I'd like to get this moving again - @Tokazama can ArrayInterfaceCore.AbstractDevice evolve/change a bit as part of this effort?

Jul 04 '22 15:07 oschulz

I think the GPU side of things might be able to change without any breaking changes for us. We could probably justify some straightforward breaking changes if absolutely necessary if it unified the LoopVectorization and GPU ecosystems.

Jul 04 '22 17:07 Tokazama

Thanks @Tokazama . Ok I'll take a look - otherwise we can just provide a conversion between the device API in this PR and ArrayInterfaceCore.AbstractDevice.

Jul 04 '22 21:07 oschulz

Some of the proposed benefits are already available from the very lightweight GPUArraysCore.jl (which did not exist when the issue was opened). You can test x isa AbstractGPUArray, and call adapt of course.

I don't think you can "query total and available memory". But perhaps that could be something like GPUArraysCore.total_memory(::Type{<:Array}) = Sys.total_memory() with a method for the type CuArray?

One thing you can't do is CUDA.functional. Flux uses this so that calling model |> gpu does nothing if there's no GPU. Edit: I wonder how often that could be replaced with total_memory(CuArray) > 0?

Jul 22 '22 23:07 mcabbott

It would be nice if we could have the GPU ecosystem and CPU/SIMD/LoopVectorization ecosystems share a common interface though.

Jul 23 '22 00:07 Tokazama

Also, this PR is not just about checking if an array is a GPU array, but about checking what kind of device(s) a whole data structure is located on. Plus the ability to use adapt to move data structures to a specific device. I'm very happy that we have GPUArraysCore.jl now, but I think we definitely need a common super-type for accelerator (and non-accelerator!) computing devices as well as for arrays. KernelAbstractions.jl, for example, doesn't need to define an extra device type for each supported backend anymore then.

As for pushing this forward: Maybe for now we can just work on fixing this up and merging this, and then look at possible connections to ArrayInterfaceCore.jl?

Jul 23 '22 17:07 oschulz

RFC: Establish concept of a computing device

Codecov Report