llvm
llvm copied to clipboard
[SYCL][CUDA][PI][runtime][ABI-break] Add support for multi-device context
Introduces support for having multiple CUDA devices in one context.
To facilitate moving buffer and image memory between devices within the same context, some ABI-breaking changes had to be made to the runtime and PI interface.
This includes expanding the check for whether memory needs to be moved from just checking whether the context is the same to also checking whether device is the same. So this creates a performance regression for multiple devices that share memory within the same context. These will now also make copies of memory allocations for each device. This will be resolved in a future pull request, when we introduce a memory migration PI API for direct transfers between devices without going trough the host.
Tests in https://github.com/intel/llvm-test-suite/pull/1102.
/verify with https://github.com/intel/llvm-test-suite/pull/1102
/verify with https://github.com/intel/llvm-test-suite/pull/1102
/verify with https://github.com/intel/llvm-test-suite/pull/1102
Running appropriate test suite branch resolves some of the test failures. Other tests fail for me locally even on sycl branch (and seem to be testing stuff unrelated to changes in this PR anyway). So this is ready for a review.
@kbobrovs @smaslov-intel @sergey-semenov friendly ping for a follow up review after most recent changes.
@t4c1, @AerialMantis we'd like to close breaking changes window according to original plan (https://github.com/intel/llvm/blob/sycl/sycl/doc/developer/ABIPolicyGuide.md#abi-breaking-changes-window-june-15---september-15-2022)
At this point it seems it requires conflict resolution and review from @sergey-semenov. Can you make it non-ABI-breaking and update PR after the window closure?
@romanovvlad could you have a quick look at the allocateMem
changes in this patch? The ABI tests seem to suggest they're ABI breaking changes but since your patch in https://github.com/intel/llvm/pull/6600 I believe they shouldn't be? Are they actually ABI breaking or is the test wrong?
@romanovvlad could you have a quick look at the
allocateMem
changes in this patch? The ABI tests seem to suggest they're ABI breaking changes but since your patch in #6600 I believe they shouldn't be? Are they actually ABI breaking or is the test wrong?
Hi, the symbol is being exported, but cannot be referenced from the user app. In #6600 I forgot to remove __SYCL_EXPORT
from the declaration of these classes which makes them continue exporting symbols.
To be on the safe side I would suggest introducing a new symbol while keeping the old symbol with an assert("Something like: Should not be called")
in the body.
I'm picking this patch up, and I wanted to summarize again the changes happening here.
So the main goal is to support sycl::context
with multiple devices
for the CUDA plugin.
However unlike OpenCL CUDA doesn't have a direct way of managing allocations for multiple devices, so we would need to manage in the plugin all the "context" level allocations and handle transfers between devices internally.
As I understand it this is what the LevelZero plugin does, and it's also how the OpenCL plugin does it, simply because that's how OpenCL works in the first place.
However the SYCL runtime already has a memory manager, so for the CUDA plugin we'd like to simply let the SYCL runtime manage memory allocations within a SYCL context for us.
To do this there is two main changes we require:
- Do PI memory allocations on
pi_context
ANDpi_device
rather than justpi_context
.
- This is possible because currently in the SYCL runtime we only do lazy allocations, so everytime we want to do an allocation we have the information of which device this allocation should go on, not just which context. And since we're already doing the allocation lazily in the SYCL runtime it doesn't make sense that the plugin would then do a second level of lazy allocation by allocating the memory for the context first, and then only later moving it to the specific device. So even for plugins which do manage the allocations at a context level it may be helpful to also have the device information to optimize the initial allocation.
- The only time when we may not have the device when doing an
allocation is if the
context_bound
property is used, but that property is not implemented in DPC++ and currently doesn't do anything. And in that case we could still pass a device to PI, we would just need to pick one in the context, like say the first device of the context, this may not always be optimal but should work fairly well.
- Let the PI plugins inform the SYCL runtime on how allocations and transfers should be managed.
- Introducing
piextGetMemoryConnection
PI interface allowing the plugin to report how the memory can be transferred or accessed between pairs of devices and contexts:-
PI_MEMORY_CONNECTION_NONE
: memory in the first (context, device) pair cannot be used or migrated by the plugin into the second (context, device) pair, copies through host are necessary. -
PI_MEMORY_CONNECTION_MIGRATABLE
: memory in the first (context, device) pair cannot be used directly by the second (context, device) pair, but the plugin can handle migrating data between the two. -
PI_MEMORY_CONNECTION_UNIFIED
: memory in the first (context, device) pair is usable in the second pair.
-
- With this model something like the OpenCL plugin can simply report
PI_MEMORY_CONNECTION_UNIFIED
when the contexts are the same, andPI_MEMORY_CONNECTION_NONE
otherwise, which will maintain the current behavior. But a plugin like CUDA can always reportPI_MEMORY_CONNECTION_MIGRATABLE
, which will inform the SYCL runtime that it needs to create separate allocations for devices within the same context for this plugin, and that copying the data should be handled by the plugin. - A side effect of this approach on top of allowing the SYCL runtime
to manage allocations within a context is that it can also allow a
plugin to optimize copies between contexts. Currently copies
between contexts are always done through host but if a plugin can
handle these more efficiently it can simply report
PI_MEMORY_CONNECTION_MIGRATABLE
and the SYCL runtime will defer to it for the transfer. - Another thing to note is that if a plugin reports
PI_MEMORY_CONNECTION_UNIFIED
for different contexts, it means that the SYCL runtime may re-use the same allocation for both contexts, I believe this should be fine but may break the idea that contexts are separated, however currently none of our plugins would make use of that.
And a quick note about the ABI changes, adding the device to memory
allocations changes the interface of methods that are still currently
exported, however they're not available to users, so as suggest by
@romanovvlad I've kept the symbols but moved the functionality into a new
overload, so I believe it shouldn't be breaking the ABI between
libsycl.so
and user applications anymore. It does however still break the
ABI between libsycl.so
and the PI plugins but I believe this one is
okay to break as they should be shipped together at the moment.
#7808
what's the status pf this please?
So this is in a fairly good state (I do need to resolve conflicts), and it works well on CUDA platforms but the CI does flag some issues on other platforms that still need to be investigated.
But even once this is resolved I'm not sure we're ready to merge, because it does changes the PI API, in a way that works well in the SYCL runtime but that may be problematic for other uses, so I've brought up the changes here to the Unified Runtime:
- https://github.com/oneapi-src/unified-runtime/issues/53
So it might make sense to hold off on merging this until we discuss it further for the Unified Runtime.
/verify with https://github.com/intel/llvm-test-suite/pull/1102
Should these changes be mirrored in unified-runtime @npmiller ?