array-api-compat [DNM] ENH: CuPy multi-device support

Add support for multiple devices in CuPy

UNTESTED: I don't know how to test this without a dual-GPU box, which I don't have access to. Suggestions welcome. At any rate, any tests for this should be in array-api-tests, replicating the same pattern as in https://github.com/scipy/scipy/pull/22756. Tracker: https://github.com/data-apis/array-api-tests/issues/302
Validate device keyword in array_api_compat.numpy.astype
Validate device keyword in all Dask functions

Mar 31 '25 10:03 crusaderky

@tylerjereddy as you have a dual-GPU box, could you help testing this?

Mar 31 '25 10:03 crusaderky

CI failure is unrelated

Mar 31 '25 11:03 crusaderky

cc @lucyleeow who has been recently working on multiple device support for scikit-learn

Mar 31 '25 11:03 ev-br

It's not clear to me what happens in cupy non-creation functions.

e.g.

with cupy.cuda.Device(1):
    x = cp.asarray(1)
with cupy.cuda.Device(0):
    y = x + 1
assert y.device == x.device

Does the device propagate from the input, like the Array API dictates, or does the interpreter-level default / context prevail? In the latter case this PR is insufficient.

Mar 31 '25 12:03 crusaderky

In some multi GPU setups the following happens

In [7]: with cp.cuda.Device(0):
   ...:     y = x + 1
   ...:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 2
      1 with cp.cuda.Device(0):
----> 2     y = x + 1

File cupy/_core/core.pyx:1322, in cupy._core.core._ndarray_base.__add__()

File cupy/_core/core.pyx:1746, in cupy._core.core._ndarray_base.__array_ufunc__()

File cupy/_core/_kernel.pyx:1285, in cupy._core._kernel.ufunc.__call__()

File cupy/_core/_kernel.pyx:159, in cupy._core._kernel._preprocess_args()

File cupy/_core/_kernel.pyx:130, in cupy._core._kernel._preprocess_arg()

File cupy/_core/_kernel.pyx:120, in cupy._core._kernel._check_peer_access()

ValueError: The device where the array resides (1) is different from the current device (0). Peer access is unavailable between these devices.

Mar 31 '25 12:03 betatim

ValueError: The device where the array resides (1) is different from the current device (0). Peer access is unavailable between these devices.

Damn. That's a showstopper. We can patch functions, but we can't patch array methods. Are there config flags that can change the behaviour to propagate from input to outut?

Mar 31 '25 12:03 crusaderky

I'm not sure what should happen here. We are trying to combine arrays on two different devices.

From the docs it doesn't sound like you can use the context manager to move an existing array to a different device. In the example above x is on one device and then when y = x + 1 is executed the implicitly created array from 1 is on another device. Which, in general, you can't combine together to create y (see https://docs.cupy.dev/en/stable/user_guide/basic.html#current-device for "sometimes this might work but we recommend that you don't do this").

This doesn't seem that unreasonable and I don't think the standard says anything about how arrys on different devices should/shouldn't be combined?

Mar 31 '25 13:03 betatim

I'm not sure what should happen here. We are trying to combine arrays on two different devices. This doesn't seem that unreasonable and I don't think the standard says anything about how arrys on different devices should/shouldn't be combined?

No, we're trying to propagate the input to the output, which the standard does say is what should happen:

Preserve device assignment as much as possible (e.g. output arrays from a function are expected to be on the same device as input arrays to the function).

In

with cupy.cuda.Device(1):
    x = cp.asarray(1)
with cupy.cuda.Device(0):
    y = x + 1
assert y.device == x.device

I expect a.__add__(b) to ignore the global and context device and just use a.device, which must match b.device if b is an Array. There is no expectation to have binops with mismatched input devices work

Mar 31 '25 13:03 crusaderky

(what is up with the automatic copilot review 👀)

Mar 31 '25 15:03 lucascolley

Your expectation and mine don't agree, which is what makes me think that it isn't clear what should happen.

I read "Raise an exception if an operation involves arrays on different devices" combined with "If a library has multiple ways of controlling device placement, the most explicit method should have the highest priority." as the user asking for something that isn't possible because they asked for Device(0) with the context manager but want to operate on something that is on a different device.

Mar 31 '25 16:03 betatim

Your expectation and mine don't agree, which is what makes me think that it isn't clear what should happen.

They're not my expectations, they're the standard's recommendations: https://data-apis.org/array-api/latest/design_topics/device_support.html#semantics

"If a library has multiple ways of controlling device placement, the most explicit method should have the highest priority."

You're quoting point 5, but you skipped over point 2:

"Preserve device assignment as much as possible (e.g. output arrays from a function are expected to be on the same device as input arrays to the function)."

Also, the context manager is just an example for the sake of making a reproducer. A more realistic example is:

with cupy.cuda.Device(1):
    x = ...
# ... 1000 lines and 3 modules later...
y = scipy.logsumexp(x)  # x is on device 1, but the current device is 0

Here there is no "most explicit method". What should be more obvious to the user, that they're currently on the default device 0, or that the array is on device 1? Are you saying that the desirable behaviour in CuPy is to crash, and that it should ignore the Standard's recommendation to propagate from the input?

Mar 31 '25 16:03 crusaderky

My expectation is that y is on the same device as x because of "avoid transfers where possible"

Mar 31 '25 17:03 betatim

My expectation is that y is on the same device as x because of "avoid transfers where possible"

So you agree that there should be propagation from input to output. How is y = x + 1 a few comments above different from it?

Mar 31 '25 17:03 crusaderky

The problem is that the CuPy docs say:

All CuPy operations (except for multi-GPU features and device-to-device copy) are performed on the currently active device.

I admit that in this case

with cupy.cuda.Device(1):
    x = cp.asarray(1)
with cupy.cuda.Device(0):
    y = x + 1
assert y.device == x.device

I think the intuitive behaviour would be for y to retain x's device—any problems regarding an 'implicit device' of the Python scalar 1 seem to be implementation detail shortcomings.

The bigger problem is whether such behaviour which is intuitive for me (and seems to align with the standard) would break the model-by-design of CuPy's multi-device support. For example, if you are supposed to treat Python scalars not as scalars but as already arrays on the default device, before their input to any functions is considered. Again, that seems unintuitive to me, but if that is the model CuPy is committed to, I think it may be worth updating the standard.

Mar 31 '25 17:03 lucascolley

It may be worth opening a CuPy issue to ask the devs about this specific example, to find out how much it is by (crucial) design or how much it is accidental.

Mar 31 '25 17:03 lucascolley

if you are supposed to treat Python scalars not as scalars but as already arrays on the default device, before their input to any functions is considered.

That feels really weird, considering that there is already explicit ad-hoc code for NEP50-style type promotion and it would feel logical to assign their device in the same way:

>>> cupy.asarray(1.0, dtype=cupy.float32) + cupy.asarray(1.0)
array(2.)  # float64
>>> cupy.asarray(1.0, dtype=cupy.float32) + 1.0
array(2., dtype=float32)

Mar 31 '25 17:03 crusaderky

@tylerjereddy Thank you, there was indeed a bug in asarray. I fixed it now.

Apr 01 '25 08:04 crusaderky

So you agree that there should be propagation from input to output.

Yes, except when you add some explicit request to not do that. For example by using xp.empty_like(x, device=foo).

How is y = x + 1 a few comments above different from it?

Because in the y = x + 1 example we explicitly requested to use Device(0) and x is not on that device. I read the use of the context manager as the user saying "I have thought about this and I want this to happen on Device(0)". The context manager is syntactic sugar for adding device=Device(0) to all function calls in a block.

I agree it is a bit murky what "this" is. Is "this" the addition of x and 1 (the add instructions are executed on device 0, while accessing memory from device 1)? Is "this" the placement of the resulting array y (the result of the add instructions executing somewhere is moved to the memory of device 0 if needed)? Is "this" the creation of the implicit array for 1 (we are now trying to add an array on two different devices)? But, no matter what "this" is, it seems like the user is asking for something that isn't possible, because one of the inputs is on Device(1) and we should not implicitly move arrays. Ignoring the context manager also seems wrong.

Apr 01 '25 10:04 betatim

@kmaehashi @leofang @asi1024 hello! We have a question about how to interpret CuPy's device context manager. The following seems clear:

with cp.cuda.Device(1):
    x = cp.asarray(1)  # x should be on 'current device' Device(1)
    y = cp.asarray(1, device=cp.cuda.Device(0))  # y should be on Device(0), as explicitly requested

But what about the following example:

with cp.cuda.Device(1):
    y = cp.asarray(1, device=cp.cuda.Device(0))  # y should be on Device(0)
    z = y + 1

How 'strong' is the context manager supposed to be here? It seems like there are at least a couple options:

Everything should be forced to be on Device(1) unless explicitly requested otherwise in the function call. Thus z should be on Device(1), and we should throw an exception if this can't happen.
The context manager is expected to apply to array creation, but needn't stop z = y + 1 from propagating y's device to z. So z can be on Device(0).

https://github.com/data-apis/array-api-compat/pull/293#issuecomment-2766038194 suggests that in practice, the current CuPy implementation leans towards option (1), failing when peer access cannot be established. If so, is this a deliberate/crucial design aspect, or more so accidental?

If (1) is by design, how extreme does it go? For example should

with cp.cuda.Device(1):
    y = cp.asarray(1, device=cp.cuda.Device(0))  # y should be on Device(0)
    y = y

also raise an exception if y cannot be accessed on Device(1)?

I suppose there is also the more extreme option 3 where

with cp.cuda.Device(1):
    y = cp.asarray(1, device=cp.cuda.Device(0))

should itself throw an exception, with the context manager overriding the device argument, but that would directly contradict https://data-apis.org/array-api/draft/design_topics/device_support.html#semantics.

Apr 01 '25 10:04 lucascolley

So you agree that there should be propagation from input to output.

Yes, except when you add some explicit request to not do that. For example by using xp.empty_like(x, device=foo).

This much is clear.

How is y = x + 1 a few comments above different from it?

Because in the y = x + 1 example we explicitly requested to use Device(0) and x is not on that device. I read the use of the context manager as the user saying "I have thought about this and I want this to happen on Device(0)". The context manager is syntactic sugar for adding device=Device(0) to all function calls in a block.

I agree it is a bit murky what "this" is. Is "this" the addition of x and 1 (the add instructions are executed on device 0, while accessing memory from device 1)? Is "this" the placement of the resulting array y (the result of the add instructions executing somewhere is moved to the memory of device 0 if needed)? Is "this" the creation of the implicit array for 1 (we are now trying to add an array on two different devices)? But, no matter what "this" is, it seems like the user is asking for something that isn't possible, because one of the inputs is on Device(1) and we should not implicitly move arrays. Ignoring the context manager also seems wrong.

So are you saying that the context should trump the device of the input array to a function (x + 1 is a function), but not the global default device? That, in other words, the hierarchy in your opinion should be

device= parameter of the function
context manager
device of input array(s) to the function
global device

?

Apr 01 '25 10:04 crusaderky

Just to continue providing the multi-device scenario feedback, we get farther in the control flow in the SciPy example now until encountering the dreaded ValueError: The device where the array resides (1) is different from the current device (0). Peer access is unavailable between these devices..

That is perhaps slightly more surprising since I don't see a context manager directly in the source, but I think confusion is a common theme already anyway, and the context manager may just be abstracted away in the shims here (or is the global default device still overriding even without the context?) so I don't "see it."

Dissection at https://github.com/scipy/scipy/pull/22756#issuecomment-2770373028, but probably more something to discuss over here for now.

Apr 01 '25 18:04 tylerjereddy

CuPy does not support the device= kwarg in the array constructors today, so some works have to happen first. But if we were to support it now I'd be supportive to enforce honoring the kwarg (and ignore the global setting or local context manager, to the max extents possible).

Apr 01 '25 18:04 leofang

Dissection at scipy/scipy#22756 (comment), but probably more something to discuss over here for now.

xp.exp(x) is disregarding the device of x and instead it is using the global default of 0. No context managers are involved here.

array-api-compat could wrap every single function so that cupy respects the input argument's device, but cannot do the same for array methods, e.g. __add__, so we're dead in the water unless cupy itself changes things.

Apr 02 '25 16:04 crusaderky

@leofang do you have an opinion on the desired behaviour here?

with cp.cuda.Device(1):
    y = cp.asarray(1, device=cp.cuda.Device(0))  # y should be on Device(0)
    z = y + 1  # device of z?

Apr 03 '25 10:04 lucascolley

At yesterday's consortium meeting, everyone was in agreement that in

with cp.cuda.Device(1):
    y = cp.asarray(1, device=cp.cuda.Device(0))  # y should be on Device(0)
    z = y + 1  # device of z?

z should be on device 0. data-apis/array-api#919 makes that clear. However, CuPy maintainers were absent and did not get an opportunity to voice their opinion.

Apr 18 '25 09:04 crusaderky

At yesterday's consortium meeting, everyone was in agreement that in

(...)

z should be on device 0. #919 makes that clear. However, CuPy maintainers were absent and did not get an opportunity to voice their opinion.

Sorry I had a conflict and had to leave early yesterday. I read the meeting minute and have nothing else to add other than reiterating what I've said earlier (https://github.com/data-apis/array-api-compat/pull/293#issuecomment-2770393347). I agree z should be on the same device as y, but technically it is not a CuPy "bug", just lack of support for Array API in the main namespace. (cupy.array_api is still a thing that should be removed in favor of making the main namespace compliant https://github.com/cupy/cupy/issues/8470#issuecomment-2311516454.)

Apr 18 '25 14:04 leofang