llvm [SYCL][CUDA][PI][runtime][ABI-break] Add support for multi-device context

Introduces support for having multiple CUDA devices in one context.

To facilitate moving buffer and image memory between devices within the same context, some ABI-breaking changes had to be made to the runtime and PI interface.

This includes expanding the check for whether memory needs to be moved from just checking whether the context is the same to also checking whether device is the same. So this creates a performance regression for multiple devices that share memory within the same context. These will now also make copies of memory allocations for each device. This will be resolved in a future pull request, when we introduce a memory migration PI API for direct transfers between devices without going trough the host.

Tests in https://github.com/intel/llvm-test-suite/pull/1102.

Jul 15 '22 12:07 t4c1

/verify with https://github.com/intel/llvm-test-suite/pull/1102

Jul 20 '22 06:07 t4c1

/verify with https://github.com/intel/llvm-test-suite/pull/1102

Aug 19 '22 06:08 t4c1

/verify with https://github.com/intel/llvm-test-suite/pull/1102

Aug 19 '22 09:08 t4c1

Running appropriate test suite branch resolves some of the test failures. Other tests fail for me locally even on sycl branch (and seem to be testing stuff unrelated to changes in this PR anyway). So this is ready for a review.

Aug 22 '22 10:08 t4c1

@kbobrovs @smaslov-intel @sergey-semenov friendly ping for a follow up review after most recent changes.

Sep 07 '22 09:09 AerialMantis

@t4c1, @AerialMantis we'd like to close breaking changes window according to original plan (https://github.com/intel/llvm/blob/sycl/sycl/doc/developer/ABIPolicyGuide.md#abi-breaking-changes-window-june-15---september-15-2022)

At this point it seems it requires conflict resolution and review from @sergey-semenov. Can you make it non-ABI-breaking and update PR after the window closure?

Sep 15 '22 20:09 pvchupin

@romanovvlad could you have a quick look at the allocateMem changes in this patch? The ABI tests seem to suggest they're ABI breaking changes but since your patch in https://github.com/intel/llvm/pull/6600 I believe they shouldn't be? Are they actually ABI breaking or is the test wrong?

Nov 11 '22 10:11 npmiller

@romanovvlad could you have a quick look at the allocateMem changes in this patch? The ABI tests seem to suggest they're ABI breaking changes but since your patch in #6600 I believe they shouldn't be? Are they actually ABI breaking or is the test wrong?

Hi, the symbol is being exported, but cannot be referenced from the user app. In #6600 I forgot to remove __SYCL_EXPORT from the declaration of these classes which makes them continue exporting symbols. To be on the safe side I would suggest introducing a new symbol while keeping the old symbol with an assert("Something like: Should not be called") in the body.

Nov 14 '22 09:11 romanovvlad

I'm picking this patch up, and I wanted to summarize again the changes happening here.

So the main goal is to support sycl::context with multiple devices for the CUDA plugin.

However unlike OpenCL CUDA doesn't have a direct way of managing allocations for multiple devices, so we would need to manage in the plugin all the "context" level allocations and handle transfers between devices internally.

As I understand it this is what the LevelZero plugin does, and it's also how the OpenCL plugin does it, simply because that's how OpenCL works in the first place.

However the SYCL runtime already has a memory manager, so for the CUDA plugin we'd like to simply let the SYCL runtime manage memory allocations within a SYCL context for us.

To do this there is two main changes we require:

Do PI memory allocations on pi_context AND pi_device rather than just pi_context.

This is possible because currently in the SYCL runtime we only do lazy allocations, so everytime we want to do an allocation we have the information of which device this allocation should go on, not just which context. And since we're already doing the allocation lazily in the SYCL runtime it doesn't make sense that the plugin would then do a second level of lazy allocation by allocating the memory for the context first, and then only later moving it to the specific device. So even for plugins which do manage the allocations at a context level it may be helpful to also have the device information to optimize the initial allocation.
The only time when we may not have the device when doing an allocation is if the context_bound property is used, but that property is not implemented in DPC++ and currently doesn't do anything. And in that case we could still pass a device to PI, we would just need to pick one in the context, like say the first device of the context, this may not always be optimal but should work fairly well.

Let the PI plugins inform the SYCL runtime on how allocations and transfers should be managed.

Introducing piextGetMemoryConnection PI interface allowing the plugin to report how the memory can be transferred or accessed between pairs of devices and contexts:
- PI_MEMORY_CONNECTION_NONE: memory in the first (context, device) pair cannot be used or migrated by the plugin into the second (context, device) pair, copies through host are necessary.
- PI_MEMORY_CONNECTION_MIGRATABLE: memory in the first (context, device) pair cannot be used directly by the second (context, device) pair, but the plugin can handle migrating data between the two.
- PI_MEMORY_CONNECTION_UNIFIED: memory in the first (context, device) pair is usable in the second pair.
With this model something like the OpenCL plugin can simply report PI_MEMORY_CONNECTION_UNIFIED when the contexts are the same, and PI_MEMORY_CONNECTION_NONE otherwise, which will maintain the current behavior. But a plugin like CUDA can always report PI_MEMORY_CONNECTION_MIGRATABLE, which will inform the SYCL runtime that it needs to create separate allocations for devices within the same context for this plugin, and that copying the data should be handled by the plugin.
A side effect of this approach on top of allowing the SYCL runtime to manage allocations within a context is that it can also allow a plugin to optimize copies between contexts. Currently copies between contexts are always done through host but if a plugin can handle these more efficiently it can simply report PI_MEMORY_CONNECTION_MIGRATABLE and the SYCL runtime will defer to it for the transfer.
Another thing to note is that if a plugin reports PI_MEMORY_CONNECTION_UNIFIED for different contexts, it means that the SYCL runtime may re-use the same allocation for both contexts, I believe this should be fine but may break the idea that contexts are separated, however currently none of our plugins would make use of that.

And a quick note about the ABI changes, adding the device to memory allocations changes the interface of methods that are still currently exported, however they're not available to users, so as suggest by @romanovvlad I've kept the symbols but moved the functionality into a new overload, so I believe it shouldn't be breaking the ABI between libsycl.so and user applications anymore. It does however still break the ABI between libsycl.so and the PI plugins but I believe this one is okay to break as they should be shipped together at the moment.

Nov 15 '22 15:11 npmiller

#7808

Dec 16 '22 15:12 pengtu

what's the status pf this please?

Dec 19 '22 23:12 smaslov-intel

So this is in a fairly good state (I do need to resolve conflicts), and it works well on CUDA platforms but the CI does flag some issues on other platforms that still need to be investigated.

But even once this is resolved I'm not sure we're ready to merge, because it does changes the PI API, in a way that works well in the SYCL runtime but that may be problematic for other uses, so I've brought up the changes here to the Unified Runtime:

https://github.com/oneapi-src/unified-runtime/issues/53

So it might make sense to hold off on merging this until we discuss it further for the Unified Runtime.

Jan 02 '23 13:01 npmiller

/verify with https://github.com/intel/llvm-test-suite/pull/1102

Jan 27 '23 21:01 npmiller

Should these changes be mirrored in unified-runtime @npmiller ?

Apr 10 '23 12:04 veselypeta

llvm llvm copied to clipboard

[SYCL][CUDA][PI][runtime][ABI-break] Add support for multi-device context

llvm
llvm copied to clipboard