llvm [SYCL] Partial implementation of sycl_ext_oneapi_cuda_cluster

This PR is a partial implementation of sycl_ext_oneapi_cuda_cluster_group, introducing the cluster_size property to launch a kernel with CUDA's thread block clusters

Only a small part of the extension specification described in https://github.com/intel/llvm/pull/13594 is used in this implementation. To be specific everything after the section "Launching a kernel with a cluster_group" is not included in this PR. A very important point to note is that this PR still fully represents a functional use case of using Nvidia's cuda driver cluster launch feature for its primary purpose which is to accelerate cross-work-group collective operations (particularly for GEMM), leveraging cross-work group asynchronous multi-casting of distributed shared memory across work-groups. This is a high priority feature that is targeted for the next release.

The other parts of the extension specification described in https://github.com/intel/llvm/pull/13594, primarily related to the "cluster_group" abstraction is a (user-facing) convenience abstraction that is not required to be exposed in libraries that optimize such library collective operations (GEMM). Please therefore focus reviews of this PR on the relevant aspects of the extension that are required for the implementation in this PR and the library based application of it as described in this message.

Jun 10 '24 13:06 AD2605

I'll add a mock PI test that the ci is complaining about missing.

Jun 10 '24 14:06 JackAKirk

How you tested the interaction of cluster groups and kernel fusion, i.e., do you expect the combination of cluster groups and kernel fusion to work?

Not with the current set of changes, no. Though in theory, depending on the order of operations, it should be possible to fuse a set of kernels out of which one uses clusters. But this is not supported At the moment. I will disable it as you suggest.

Jun 10 '24 15:06 AD2605

Hi @sergey-semenov

It would be great to get your review on this from a runtime perspective. There are still a couple failures related to abi tests; but we want to get your feedback on the implementation now ideally, in case you think there are any issues related to abi etc. This is a priority feature aimed for the next release. We'd like to get any feedback on it early on, so as we can make sure we can get it merged on time.

Many thanks

Jun 13 '24 12:06 JackAKirk

The tests for this PR depend on - https://github.com/oneapi-src/unified-runtime/pull/1765

Jun 17 '24 15:06 AD2605

The tests for this PR depend on - oneapi-src/unified-runtime#1765

You can point this PR to use that UR commit.

Here is an example doing this: https://github.com/intel/llvm/pull/13860#discussion_r1622665825

Jun 17 '24 15:06 JackAKirk

Could use some input regarding the ABI size test. The test currently fails as the ABI has increased, hence the check fails.

Is the correct way to do this is to change the value in that test to the one being reported ? @hdelan @sergey-semenov

Jun 17 '24 16:06 AD2605

@AD2605 That's right.

Jun 17 '24 16:06 sergey-semenov

Do we have guarantees that this won't break when composed with other kernels?

Sorry I do not understand what you are asking for here, could you please elaborate ?

Have a look at all the different use cases that are tested for other enqueue operations, like host tasks, for inspiration.

This part of the implementation of the spec, in essence is just an addition of a launch property, and the launch mechanism piggy backs on the enqueue functions extensions, so whatever enqueue functions support, this will support that as well

Jun 18 '24 09:06 AD2605

Sorry I do not understand what you are asking for here, could you please elaborate ?

This is calling a special kernel launch entry point in UR. Parallel fors also do the same thing, they end up calling piEnqueueKernelLaunch. In sycl-e2e we test that parallel fors work in a number of ways:

Do they do the work that we expect them to do? ie will the kernel write some memory that can then be read on host. Your test is doing this already so that's OK.

What this test is missing is:

Do they honour the dependencies that we pass to them? ie if I do:

auto e = parallel_for(...);
parallel_for(..., e);

Will the second kernel only begin once the first has completed? We need to do this with buffers and accessors as well as with explicit event dependencies. Will this work also if one of the dependencies is a normal parallel_for/host_task etc?

I would like to see some testing that makes sure that the normal kernel launch operations still honour dependencies. At the moment I am not satisfied that any testing could find a race condition in code, if dependencies were being modelled incorrectly.

This part of the implementation of the spec, in essence is just an addition of a launch property, and the launch mechanism piggy backs on the enqueue functions extensions, so whatever enqueue functions support, this will support that as well

This is calling a unique UR entry point, which is not called elsewhere in DPC++. As such it needs to be tested in isolation without assuming that it is getting functionality from free functions etc.

Jun 18 '24 09:06 hdelan

I would like to see some testing that makes sure that the normal kernel launch operations still honour dependencies. At the moment I am not satisfied that any testing could find a race condition in code, if dependencies were being modelled incorrectly.

@hdelan , added a test in this, to test whether or not event dependencies are being correctly modelled / honored

Jun 20 '24 10:06 AD2605

I would like to see some testing that makes sure that the normal kernel launch operations still honour dependencies. At the moment I am not satisfied that any testing could find a race condition in code, if dependencies were being modelled incorrectly.

@hdelan , added a test in this, to test whether or not event dependencies are being correctly modelled / honored

That's good stuff, thanks.

Jun 20 '24 11:06 hdelan

@intel/sycl-graphs-reviewers @intel/unified-runtime-reviewers @intel/dpcpp-nativecpu-pi-reviewers

Would it be possible to get a review here?

Thanks

Jun 20 '24 15:06 JackAKirk

Arc node failure unrelated. Arc tests are only for joint_matrix anyway, so shouldn't be affected by these changes.

Jun 25 '24 15:06 JackAKirk

I think that the AddressSanitizer failures are unrelated, @intel/dpcpp-tools-reviewers would you be able to confirm?

Thanks

Jun 27 '24 15:06 JackAKirk

LGTM, some minor comments about the E2E tests

Thanks, will fix those.

Jul 02 '24 13:07 JackAKirk

@intel/dpcpp-tools-reviewers

Would it be possible for you to review this now?

Thanks

Jul 02 '24 14:07 JackAKirk

@intel/unified-runtime-reviewers Could you please give this a review?

Jul 02 '24 16:07 JackAKirk

@intel/dpcpp-tools-reviewers this is passing all tests (windows CI failure is unrelated CI issue seen in other PRs) just waiting for your review. This PR is blocking a feature marked highest priority. I keep on having to resolve conflicts to get this merged and the absolute deadline for this is quickly approaching, so would appreciate a review asap.

Jul 02 '24 19:07 JackAKirk

This is up to date with sycl branch. Please do not cancel if there is no longer an issue in sycl branch as per the latest communication.

Jul 02 '24 20:07 JackAKirk

Device config file changes LGTM

Many thanks!

Jul 04 '24 12:07 JackAKirk

@intel/llvm-gatekeepers

Please merge this asap. I'd be very grateful. Thanks.

This has passed all tests. With 50 changed files it is a race against time to avoid further conflicts occurring.

This is a highest priority PR that needs to be merged asap.

Jul 04 '24 21:07 JackAKirk

Most of my comments are small and could be addressed in a follow-up.

However, the ABI-break I am a little more concerned about. Technically speaking, we are allowed to do it, although the PR should still make it clear that it is breaking, but changing the handler layout is one of the bigger ones and I think we should consider if we should/need to do it.

In the meantime, it would be good to have some unittests for this change, if possible.

Regarding unittests. For example a check that the scheduler logic correctly calls the pi function: I intentionally did not add this because I know that PI is being removed in a week or so, and correspondingly unittests will have to be more or less remade to deal with UR directly as far as I understand.

If there is already a decided upon way to deal with unittests going forward that I can add at this point I would be willing to do so. Alternatively I can make sure to add such a test in a follow up.

There are a series of more integration tests in test-e2e that @AD2605 added that tests the various ways that the cluster_dims property can be passed through the scheduler, that whilst not as specific as what I described above, do at least cover for any such failures of the scheduler part, even if they do not pinpoint them as specifically in isolation.

Jul 05 '24 14:07 JackAKirk

If there is already a decided upon way to deal with unittests going forward that I can add at this point I would be willing to do so. Alternatively I can make sure to add such a test in a follow up.

If it avoids double-work, I am alright with waiting a bit. I don't know of any current alternative.

There are a series of more integration tests in test-e2e that @AD2605 added that tests the various ways that the cluster_dims property can be passed through the scheduler, that whilst not as specific as what I described above, do at least cover for any such failures of the scheduler part, even if they do not pinpoint them as specifically in isolation.

I still think there is value in unittests, especially for a feature like this where the scheduler needs to propagate the additional information, take special paths because of it and call new UR APIs. Even if the E2E tests cover it, a unittest would likely make it a lot easier to isolate the cause of failure.

Jul 08 '24 05:07 steffenlarsen

If there is already a decided upon way to deal with unittests going forward that I can add at this point I would be willing to do so. Alternatively I can make sure to add such a test in a follow up.

If it avoids double-work, I am alright with waiting a bit. I don't know of any current alternative.

There are a series of more integration tests in test-e2e that @AD2605 added that tests the various ways that the cluster_dims property can be passed through the scheduler, that whilst not as specific as what I described above, do at least cover for any such failures of the scheduler part, even if they do not pinpoint them as specifically in isolation.

I still think there is value in unittests, especially for a feature like this where the scheduler needs to propagate the additional information, take special paths because of it and call new UR APIs. Even if the E2E tests cover it, a unittest would likely make it a lot easier to isolate the cause of failure.

Yeah I see it makes sense to add a unittest. I'll add one after PI is removed. If you are otherwise happy to approve, this can be merged now. Thanks

Jul 08 '24 06:07 JackAKirk

llvm
llvm copied to clipboard

[SYCL] Partial implementation of sycl_ext_oneapi_cuda_cluster_group

llvm llvm copied to clipboard

[SYCL] Partial implementation of sycl_ext_oneapi_cuda_cluster_group

llvm
llvm copied to clipboard