llvm [SYCL][DOC] Initial commit of oneapi extension proposal for adding P2P

...mechanisms to SYCL

Signed-off-by: James Brodman [email protected]

May 05 '22 19:05 jbrodman

Thanks for posting this. I have a few questions:

Just to be completely clear: Is the implication that if ext_oneapi_can_access_peer(device_b) returns true when called from device_a then users can have kernels running on device_a that can access a USM memory pointer, "ptr_b", allocated on device_b and this will lead to a "peer access" (rather than a "peer copy") from device_a to the memory located on device_b?
Do you have plans to support peer memory copies (as opposed to "peer accesses") from "ptr_b" to "ptr_a" on device_a within this proposal at a later date?

With regard to this I read through the proposed clarification to Peer access rules in SYCL 2020 next here: https://github.com/gmlueck/SYCL-Docs/commit/76e1b4419f0d3ea0c3c4f596d8e2e40e93c74dfc ; it seems that references to "migratable usm" have been removed. I didn't read it thoroughly but I didn't find an explicit mention of peer to peer USM copies (as opposed to Peer to Peer access). cc @gmlueck: Is this intentional?

Do you have plans to support P2P for buffers: With regard to buffers, currently if one instantiates a buffer then accesses it in a queue using device_a, followed by an otherQueue using a device_b then the scheduler currently copies the memory from device_a to host then host to device_b. E.g. when using:

myQueue.submit([&](handler &cgh) {
      auto read = buffer_from_1D.get_access<access::mode::read>(cgh);
      auto write = buffer_to_1D.get_access<access::mode::write>(cgh);
      cgh.parallel_for<class copyH2D_1D>(
          buffer_from_1D.get_range(),
          [=](id<1> index) { write[index] = read[index] * -1; });
    });
    myQueue.wait();

    otherQueue.submit([&](handler &cgh) {
      auto read = buffer_from_1D.get_access<access::mode::read>(cgh);
      auto write = buffer_to_1D.get_access<access::mode::write>(cgh);
      cgh.parallel_for<class copyH2D_1D_2nd>(
          buffer_from_1D.get_range(),
          [=](id<1> index) { write[index] = read[index] * 10; });
    });

The simplest way to leverage Peer memory in this case is to allow a direct memory copy from device_a to device_b (allowing it only when the devices share a context). The implementation could be very similar to this scrapped implementation for a similar peer to peer copy for devices in different contexts (now deemed not allowed): https://github.com/intel/llvm/pull/4401

Do you think that Peer access should be supported/is sensible for buffers also? If so do you have any thoughts on how this would be supported within the SYCL programming model?

Thanks

May 09 '22 17:05 JackAKirk

Just to be completely clear: Is the implication that if ext_oneapi_can_access_peer(device_b) returns true when called from device_a then users can have kernels running on device_a that can access a USM memory pointer, "ptr_b", allocated on device_b and this will lead to a "peer access" (rather than a "peer copy") from device_a to the memory located on device_b? Yes - a USM device pointer allocated on device_b will now be dereferenceable in kernels on device_a.

Do you have plans to support peer memory copies (as opposed to "peer accesses") from "ptr_b" to "ptr_a" on device_a within this proposal at a later date? That seems like an implementation detail. There's nothing that says the SYCL RT can't do this today.

Do you have plans to support P2P for buffers: With regard to buffers, currently if one instantiates a buffer then accesses it in a queue using device_a, followed by an otherQueue using a device_b then the scheduler currently copies the memory from device_a to host then host to device_b. Not at this time.

Do you think that Peer access should be supported/is sensible for buffers also? If so do you have any thoughts on how this would be supported within the SYCL programming model? No. P2P type capabilities better align with USM Device Memory. As per #2, I think a RT could do optimized peer copies using the existing copy methods.

May 19 '22 19:05 jbrodman

I think this looks quite good from the point of view of the cuda backend (apart from the one issue I describe below). I can try a simple implementation to make sure there are no other issues with CUDA implementing this.

There is one point that I'd like to clarify: Currently some backends (e.g. level_zero is implemented already) can do direct P2P copies for buffers. I think we should consider whether it is required (or not) for users to call ext_oneapi_enable_peer_access first before this buffer P2P opt is enabled.

In the CUDA backend in order to enable P2P copy (as well as enable P2P access) of memory from one device to another it would be necessary to call cuCtxEnablePeerAccess: see https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PEER__ACCESS.html#group__CUDA__PEER__ACCESS_1g0889ec6728e61c05ed359551d67b3f5a

This means that if we want to disentangle the buffer P2P optimization from the USM P2P access feature, when the runtime does a P2P buffer copy we would need to have an implementation of the buffer P2P copy optimization in the CUDA backend do:

ext_oneapi_enable_peer_access (calls cuCtxEnablePeerAccess in CUDA backend)
Perform P2P copy
ext_oneapi_disable_peer_access (calls cuCtxDisablePeerAccess in CUDA backend)

I think this would mean that in order to ensure that the (buffer) enabled peer to peer access doesn't interfere with a users usage of this USM extension (via the expected result of ext_oneapi_can_access_peer), we would have to make the buffer copy block the host thread, to prevent it from calling the ext_oneapi_can_access_peer that would temporarily return true during the time that the buffer is performing the copy. Alternatively we could miss the ext_oneapi_disable_peer_access call at the end of the buffer copy in which case ext_oneapi_can_access_peer(peer, ext::oneapi::peer_access::access_enabled) could return true without the user explicitly calling ext_oneapi_enable_peer_access.

Perhaps this is only an issue for the CUDA backend?, but this sounds pretty messy already, and I think it could be a good idea to avoid these issues by connecting this USM extension with the expected behavior of any buffer P2P copy optimization such that we could state:

The user is required to call ext_oneapi_enable_peer_access to switch on any available buffer direct P2P copy optimization between different devices.

What do you think @jbrodman @gmlueck? Would this connection between USM peer access and buffer peer copy be undesirable for the level_zero backend?

Oct 18 '22 15:10 JackAKirk

What do you think @jbrodman @gmlueck? Would this connection between USM peer access and buffer peer copy be undesirable for the level_zero backend?

Two comments:

Would the buffer-copy optimization really call cuCtxEnablePeerAccess/ cuCtxDisablePeerAccess for each copy operation? Somehow I thought these operations could be potentially slow, so enabling / disabling P2P on each copy could have disastrous performance. (I have no first-hand experience, so I might be wrong about the enable / disable APIs being slow.)
Your comments about the interaction between buffer-copy optimization and the P2P API assume that APIs in this extension are implemented by directly calling CUDA APIs. Instead, the extension could maintain some internal state that virtualizes the enable / disable state. This would allow ext_oneapi_can_access_peer(peer, ext::oneapi::peer_access::access_enabled) to return false even if P2P access was currently enable due to the buffer-copy optimization.

Oct 24 '22 20:10 gmlueck

* Your comments about the interaction between buffer-copy optimization and the P2P API assume that APIs in this extension are implemented by directly calling CUDA APIs.  Instead, the extension could maintain some internal state that virtualizes the enable / disable state.  This would allow `ext_oneapi_can_access_peer(peer, ext::oneapi::peer_access::access_enabled)` to return `false` even if P2P access was currently enable due to the buffer-copy optimization.

Sure we could deal with that in the cuda backend if we can infer the location of the pointers provided to q.memcpy. I don't immediately know how we can most straightforwardly do that because I don't see a cu device API query that takes a pointer and returns whether it is device or host allocated, but I can ask if anyone knows if this exists (of course we can defer to some more complex/slower/heavier implementation to retrieve locations (Host or Device) of pointers, but I don't think anyone will argue this is desirable if avoidable).

We would need to be able to know this so that we could move from the current cuda implementation of q.memcpy that uses cuMemcpyAsync which infers the type of copy (D2H H2D D2D) from the pointer addresses. If we managed to do this we could feasibly (in the D2D case) use the information as provided in ext_oneapi_can_access_peer to call cuMemcpyDtoHAsync then cuMemcpyHtoDAsync instead of cuMemcpyAsync (or cuMemcpyPeer) if ext_oneapi_can_access_peer returns false for the access query (or cuMemcpyDtoDAsync if the D2D copy is intra-device). If cuCtxEnablePeerAccess has been called (and no error returned and the corresponding Disable API not called), then any call to cuMemcpyAsync (or cuMemcpyPeer) will do a P2P copy when the src and device are on different peer access enabled devices. If cuCtxEnablePeerAccess has not been called then any call to cuMemcpyAsync (or cuMemcpyPeer) will do D2H H2D instead.

Now if we did this note that this would not completely remove interactions between buffer P2P and USM P2P, although the interactions that remain I think would only be for rather unusual use cases for which we need to consider the limitations of the number of active (active means cuCtxEnablePeerAccess has been called and returned without error) peer connections:

For CUDA devices this maximum number of peers is set by either:

On non-NVSwitch enabled systems, each device can support a system-wide maximum of eight peer connections.
When NVSwitch is used the maximum number of peers per device is given in a table in the following link: https://www.nvidia.com/en-gb/data-center/nvlink/ 6 for volta, 12 for Ampere, 18 for Hopper.

So if we had a hypothetical system of 9 non NVSwitch peers, 8 active connections for a single device for buffer P2P, then any user calls to ext_oneapi_enable_peer_access to enable a connection from the device to the remaining peer will have to deal with the CUDA_ERROR_TOO_MANY_PEERS thrown by cuCtxEnablePeerAccess gracefully and return false to ext_oneapi_enable_peer_access, until one of the peer connections used by buffer P2P is disabled.

I guess that there is a similar max peer constraint for level_zero?

* Would the buffer-copy optimization really call cuCtxEnablePeerAccess/ cuCtxDisablePeerAccess for each copy operation?  Somehow I thought these operations could be potentially slow, so enabling / disabling P2P on each copy could have disastrous performance.  (I have no first-hand experience, so I might be wrong about the enable / disable APIs being slow.)

My testing experience has been that execution times of cuCtxEnablePeerAccess and cuCtxDisablePeerAccess are quite a lot shorter than even very small copies (smallest I checked last year was of order 10-100 bytes). Note that I was not advocating calling cuCtxEnablePeerAccess and cuCtxDisablePeerAccess for every buffer P2P copy!. I just want to put the behaviour/constraints of the cuda peer functionality out there for you, and it is easier to do this by way of examples. There is also the question of the hip peer functionality. At the level of the runtime the APIs used in HIP match CUDA for the peer capabilities, however we haven't looked into the lower level implementation. It's probably a good first guess that the constraints of the HIP peer implementation will match cuda although this isn't guaranteed until we check.

For completeness the other thing to mention for the cuda case is that, unlike the corresponding level_zero case, peer access is granted between cuContexts rather than cuDevices: see the declaration of cuCtxEnablePeerAccess. This is an implementation detail (I'm not saying this should motivate breaking your proposed extension) but it is probably worth mentioning how this interacts with the current state of DPC++ runtime and PI_CUDA. Basically the fact that we can have multiple sycl::contexts mapped to a single sycl::device, and then multiple cuContexts per cuDevice, and the fact that sycl::device knows nothing about sycl::context currently, complicates things (Of course there are ways we can work around this).

As an aside this is another small issue motivating why we are interested in the context questions that still don't appear to have clear answers: Does the (added complexity of) the sycl::context abstraction (and the constraints that it imposes) have a real use case for: a) DPC++ target users (SYCL spec + DPC++ extensions scope) b) wider subset of all SYCL target users (SYCL spec scope)

For the purposes of easily testing the performance of queue.memcpy for different numbers of concurrent Peer connections I had to use a free function to switch on the access:

sycl::ext::oneapi::experimental::ext_oneapi_enable_peer_access(
const queue& active, const queue& peer);
}

This works because sycl::queue knows about sycl::context. Again I'm not suggesting the Device member function API change at all; rather I'm providing this information to try to represent the cuda peer functionality, and point out how it interacts with sycl::device, queue, and context concepts. Hopefully once we know all requirements of level_zero, cuda and more, it can make the most natural solution more obvious.

Oct 25 '22 13:10 JackAKirk

@jbrodman @gmlueck It looks like https://github.com/intel/llvm/pull/8197 will be merged soon. This PR makes it much easier for the nvidia backend to implement this extension. I've implemented this P2P proposal in the nvidia backend on top of #8197 here: https://github.com/JackAKirk/llvm/tree/P2P-primary-ctxt. When #8197 is merged I can open a PR for this draft implementation.

Here are some small issues I came across:

The docs for cuDeviceGetP2PAttribute here( https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PEER__ACCESS.html#group__CUDA__PEER__ACCESS_1g496bdaae1f632ebfb695b99d2c40f19e) (and cudaDeviceGetP2PAttribute in Runtime API) are basically wrong where it says "CU_DEVICE_P2P_ATTRIBUTE_ACCESS_SUPPORTED P2P: 1 if P2P Access is enable." This is wrong because (As you would actually expect from its name) when CU_DEVICE_P2P_ATTRIBUTE_ACCESS_SUPPORTED attr is used in cuDeviceGetP2PAttribute it is functionally identical to cudaDeviceCanAccessPeer: i.e it corresponds to peer_access::access_supported in this proposal, and peer_access::access_enabled is not natively supported in Nvidia. I have explicitly verified this using both the cuda Runtime and Driver APIs.
Another different wrt the level_zero functions (https://spec.oneapi.io/level-zero/latest/core/api.html#zedevicegetp2pproperties) is that if PeerDevice == Device then these nvidia queries throw errors: Nvidia does not see a device as a peer to itself, whereas L0 spec (and this extension doc) does. In the draft impl I have implemented it so that a device is seen as a peer to itself always, and that also it returns true always when the Native_Atomics query is used for PeerDevice == Device.

I will add some corresponding tests to exhibit all this functionality soon and link you them. I thought I should just let you know what I found from my investigations. Apart from the fact that peer_access::access_enabled cannot be natively supported by Nvidia, the extension spec looks good for the cuda backend.

Feb 08 '23 15:02 JackAKirk

It could be good to include or link to example usage in this doc: something based on/ similar to:


// note: practically it could also be good to provide clear directions to
// documentation showing users how to make sure they are constructing queues
// using distinct devices.

auto Dev0 = Queues[0].get_device();
auto Dev1 = Queues[1].get_device();

int *arr0 = malloc<int>(N, Queues[0], usm::alloc::device);
int *arr1 = malloc<int>(N, Queues[1], usm::alloc::device);

// note: in real use would obviously load/set arr0/arr1 with meaningful data.

if (Dev0.ext_oneapi_can_access_peer(
        Dev1, sycl::ext::oneapi::peer_access::access_supported)) {
  Dev0.ext_oneapi_enable_peer_access(Dev1);
}

// Copy will be made by P2P if supported
Queues[0].copy(arr0, arr1, N).wait();

// Copy will not be made by P2P since it has not been enabled
Queues[1].copy(arr1, arr0, N).wait();

if (Dev0.ext_oneapi_can_access_peer(
        Dev1, sycl::ext::oneapi::peer_access::access_supported)) {
  Dev0.ext_oneapi_disable_peer_access(Dev1);
}

// Copy will not be made by P2P if supported because it has now been disabled
Queues[0].copy(arr0, arr1, N).wait();

And then another example for peer access (the above is an example of peer copy). Also I have apparently discovered that P2P via nvlink with Nvidia hardware is bi-directional, such that Dev0.ext_oneapi_enable_peer_access(Dev1); also enables Dev1 to do a P2P copy to Dev0. I think that this is not the case via pcie, but I need to check this.

Feb 09 '23 11:02 JackAKirk

Hi @jbrodman @gmlueck

Just to confirm so I can update the implementation:

Is the extension OK'd now by syclomatic?
I take it that you still want access_enabled? That is fine I just wanted to double check this is the case now that we know it isn't natively supported by cuda unlike what the cuda docs imply.

Mar 02 '23 18:03 JackAKirk