llvm icon indicating copy to clipboard operation
llvm copied to clipboard

[SYCL] Fix broken uniqueness in make_device

Open JackAKirk opened this issue 3 years ago • 7 comments
trafficstars

This is a general fix for https://github.com/intel/llvm/issues/6055. CUDA device interop is not available yet but a corresponding fix will be added to the CUDA specialization of make_device in https://github.com/intel/llvm/pull/6202 shortly.

If it is uncontroversial I will add an overload for device::get_devices that takes a backend argument and then only returns devices corresponding to that backend? This would allow removing the dev.get_backend() == Backend check introduced in this PR and make the implementation more efficient.

Note that I expect there could be similar uniqueness issues for other interop types such as context etc.. We will be considering such issues in https://github.com/intel/llvm/issues/6055.

I think that interop tests basically comprising the code sample in https://github.com/intel/llvm/issues/6055 would be useful. I can add such a test for device to intel/llvm-test-suite.

Signed-off-by: JackAKirk [email protected]

JackAKirk avatar May 26 '22 17:05 JackAKirk

Why is this not relying on the existing device-cache? I would expect that the new "device_impl" would not be created because of this code: https://github.com/intel/llvm/blob/f6420c769d473c00be736b8df361e10dd98a6009/sycl/source/detail/platform_impl.cpp#L219

If a device is created via the interop API it will always create a new device_impl even when there already y exist a device_impl with the same backend handle.

I do not say it is not a problem, but where in the SYCL2020 OpenCL backend spec is it saying what should be returned by "make_device"? Do we need to update it?

Note that I expect there could be similar uniqueness issues for other interop types such as context etc.. We will be considering such issues in https://github.com/intel/llvm/issues/6055.

Why context? I think only platform and device should be such, no? Others would create unique objects.

I think that interop tests basically comprising the code sample in https://github.com/intel/llvm/issues/6055 would be useful. I can add such a test for device to intel/llvm-test-suite.

Yes, please add such test(s).

smaslov-intel avatar May 30 '22 16:05 smaslov-intel

Why is this not relying on the existing device-cache? I would expect that the new "device_impl" would not be created because of this code:

https://github.com/intel/llvm/blob/f6420c769d473c00be736b8df361e10dd98a6009/sycl/source/detail/platform_impl.cpp#L219

If a device is created via the interop API it will always create a new device_impl even when there already y exist a device_impl with the same backend handle.

I do not say it is not a problem, but where in the SYCL2020 OpenCL backend spec is it saying what should be returned by "make_device"? Do we need to update it?

Note that I expect there could be similar uniqueness issues for other interop types such as context etc.. We will be considering such issues in #6055.

Why context? I think only platform and device should be such, no? Others would create unique objects.

I think that interop tests basically comprising the code sample in #6055 would be useful. I can add such a test for device to intel/llvm-test-suite.

Yes, please add such test(s).

It is relying on device-cache in the platform class since this patch. You can see here: https://github.com/intel/llvm/blob/3c1d342e3ddf75e040cd73675998be7d7b5c8baf/sycl/source/backend.cpp#L64 that createSyclObjFromImpl<device> does not call the Platform->getOrMakeDeviceImpl method that checks the device-cache. This is why we needed to call device::get_devices that does check the device-cache and as far as I can tell is the only existing route that checks the device cache of all platforms within a backend.

I think you're right that this use-case and expected behaviour is not described in the OpenCL backend spec. However currently the runtime is allowing the creation of a second shared_ptr that takes ownership of the native object in the sample code here: https://github.com/intel/llvm/issues/6055. It doesn't seem to be a good idea to allow multiple independent shared_ptrs own the same object if it can be helped. What do you think?

I don't see any difference between the above motivation for device and the context case: however a sycl::context maps to multiple cuda contexts, so that I'm not sure we can easily do much about this issue for context at least for the CUDA backend (the same is true of queue and event). We are still finalizing the correct behaviour of interop within the CUDA backend. The most straightforward interop case is device for the CUDA backend.

JackAKirk avatar May 31 '22 10:05 JackAKirk

that createSyclObjFromImpl<device> does not call the Platform->getOrMakeDeviceImpl method that checks the device-cache. This is why we needed to call device::get_devices that does check the device-cache and as far as I can tell is the only existing route that checks the device cache of all platforms within a backend.

That's why I am asking should we rather change the "make_device" to be aware of the device-cache instead of using get_devices where there was no need to read all the devices?

I think you're right that this use-case and expected behaviour is not described in the OpenCL backend spec.

Would you spawn a spec issue to clarify this, please?

smaslov-intel avatar Jun 01 '22 14:06 smaslov-intel

that createSyclObjFromImpl<device> does not call the Platform->getOrMakeDeviceImpl method that checks the device-cache. This is why we needed to call device::get_devices that does check the device-cache and as far as I can tell is the only existing route that checks the device cache of all platforms within a backend.

That's why I am asking should we rather change the "make_device" to be aware of the device-cache instead of using get_devices where there was no need to read all the devices?

MDeviceCache is a private member of platform_impl so I thought it could be controversial to let make_device access it directly? It is true that if make_device were to access it directly the impl could be made a bit more efficient, if it is worth it. Unless you meant just iterating over only the devices that are formed from platforms within the specified backend directly within make_device, as I have done in the latest commit, rather than overloading device::get_devices to take a backend argument?

I think you're right that this use-case and expected behaviour is not described in the OpenCL backend spec.

Would you spawn a spec issue to clarify this, please?

OK

JackAKirk avatar Jun 02 '22 15:06 JackAKirk

MDeviceCache is a private member of platform_impl so I thought it could be controversial to let make_device access it directly? It is true that if make_device were to access it directly the impl could be made a bit more efficient, if it is worth it. Unless you meant just iterating over only the devices that are formed from platforms within the specified backend directly within make_device, as I have done in the latest commit, rather than overloading device::get_devices to take a backend argument?

The platform.get_devices() uses platform_impl.getOrMakeDeviceImpl to do the caching. I think having the make_device to similarly obtain a device impl through platform_impl.getOrMakeDeviceImpl makes it a more consistent (and performant) implementation.

smaslov-intel avatar Jun 05 '22 14:06 smaslov-intel

MDeviceCache is a private member of platform_impl so I thought it could be controversial to let make_device access it directly? It is true that if make_device were to access it directly the impl could be made a bit more efficient, if it is worth it. Unless you meant just iterating over only the devices that are formed from platforms within the specified backend directly within make_device, as I have done in the latest commit, rather than overloading device::get_devices to take a backend argument?

The platform.get_devices() uses platform_impl.getOrMakeDeviceImpl to do the caching. I think having the make_device to similarly obtain a device impl through platform_impl.getOrMakeDeviceImpl makes it a more consistent (and performant) implementation.

I don't think that we can directly use platform_impl.getOrMakeDeviceImpl because a backend can contain more than one platform. Therefore if we do not find the device in MDeviceCache we do not want to call:

  // Otherwise make the impl
  std::shared_ptr<device_impl> Result =
      std::make_shared<device_impl>(PiDevice, PlatformImpl);
  MDeviceCache.emplace_back(Result);

until we have checked all platforms in the backend.

I've followed through with your suggestion with a rough draft implementation using a new member: platform_impl.getDeviceImpl which only performs the first part of platform_impl.getOrMakeDeviceImpl. My latest commit will fix the assert fail in:

#include <sycl/sycl.hpp>

int main() {
  sycl::device sdev(sycl::default_selector{});
  auto ocl_dev = sycl::get_native<sycl::backend::opencl>(sdev);
  sycl::device dev(sycl::make_device<sycl::backend::opencl>(ocl_dev));
  assert(sdev == dev);
}

However if we have something like this (perhaps unusual) case:

#include "cuda_helper.hpp"
#define SYCL_EXT_ONEAPI_BACKEND_CUDA_EXPERIMENTAL 1
#include <sycl/ext/oneapi/experimental/backend/cuda.hpp>
#include <sycl/sycl.hpp>

int main() {
  sycl::device sdev(sycl::default_selector{});
  CUdevice m_cu_device;
  cuDeviceGet(&m_cu_device, 0);
  CUDA_CHECK(cuDeviceGet(&m_cu_device, 0));
  sycl::device dev1(
      sycl::make_device<sycl::backend::ext_oneapi_cuda>(m_cu_device));

  auto m_cu_device_clone =
      sycl::get_native<sycl::backend::ext_oneapi_cuda>(dev1);
  sycl::device dev1_clone(
      sycl::make_device<sycl::backend::ext_oneapi_cuda>(m_cu_device_clone));

  assert(dev1 == dev1_clone);
}

we will hit the assert because we have not updated the MDeviceCache (first we will have to find the platform associated with the device_impl that was made by the plugin). So we will need to create another route to find the correct platform and update platform_impl.MDeviceCache. This change would also mean that platform_impl.getOrMakeDeviceImpl becomes redundant, and we should call platform_impl.getDeviceImpl instead from platform_impl::get_devices and remove platform_impl.getOrMakeDeviceImpl.

I'm not sure that make_device is performance critical code since I don't imagine it will be called many times per application; perhaps I am wrong? This solution could be improving the performance of non performance critical code at the expense of a reduction of the encapsulation of platform_impl, additional class methods to platform_impl and probably worse code readability. I want to check that I did not misunderstand what you meant and you have a better solution? Or that this is still the correct direction? Or whether my original solution that made use of platform::get_devices (that I have added to the cuda implementation in the last commit for reference: beginning from: https://github.com/intel/llvm/blob/6df9b9ac8118065b050c44b03717591456ccd2bb/sycl/include/sycl/ext/oneapi/experimental/backend/cuda.hpp#L78) is better?

Thanks

JackAKirk avatar Jul 14 '22 15:07 JackAKirk

@smaslov-intel What do you think about my above comment/questions?

JackAKirk avatar Aug 01 '22 16:08 JackAKirk

@smaslov-intel friendly ping on this, what are your thoughts on @JackAKirk's comments above?

AerialMantis avatar Sep 06 '22 12:09 AerialMantis