llvm icon indicating copy to clipboard operation
llvm copied to clipboard

[SYCL][OpenCL] Host accessors wait for barriers when not needed

Open al42and opened this issue 4 years ago • 6 comments

Describe the bug

I want to submit many tasks to the device queue asynchronously, minimizing the CPU time:

  • Do some operations on bufferB in queue B, synchronize with it. Apparently, this binds the buffer to the queue?
  • Submit a long-running task to queue A, not touching bufferB.
  • Submit a barrier waiting for this task to queue B.
  • Request a discard_write host accessor to bufferB, and write some data to it.

The last step stalls the CPU while waiting for the preceding kernel to complete, even though this kernel is not manipulating the buffer in question and I did not explicitly request the use of queueB. Apparently, the constructor of the host accessor calls clEnqueueMapBuffer with the barrier in its event_wait_list.

If instead I create a host std::vector, fill it and then use cgh.copy(host_ptr, accessor), then the copy is submitted asynchronously, and the control is returned to the CPU thread pretty fast.

This only happens when using OpenCL backend; LevelZero runs fine.

To Reproduce

#include <CL/sycl.hpp>
#include <iostream>

using cl::sycl::access::mode;

int main() {
  cl::sycl::device dev(cl::sycl::gpu_selector{});
  cl::sycl::queue qA{dev}, qB{dev};

  constexpr int size = 128 * 128 * 128;
  cl::sycl::buffer<int, 1> bufferA(size), bufferB(size);

  const cl::sycl::nd_range<1> range1D{{size}, {128}};
  std::vector<int> x(size);

  // Fill bufferB via queueB.
  qB.submit([&](cl::sycl::handler &cgh) {
    auto b = bufferB.get_access<mode::discard_write>(cgh);
    cgh.parallel_for<class KernelNameB>(
        range1D, [=](cl::sycl::nd_item<1> itemIdx) {
          int i = itemIdx.get_global_linear_id();
          b[i] = i;
        });
  });

  // Wait for queue to do all the required tasks.
  qB.wait_and_throw();

  // Start kernelA operating on bufferA to queueA.
  // Does not matter what exactly it is doing, just some long calculations
  cl::sycl::event e = qA.submit([&](cl::sycl::handler &cgh) {
    auto a = bufferA.get_access<mode::write>(cgh);
    cgh.parallel_for<class KernelA>(range1D, [=](cl::sycl::nd_item<1> itemIdx) {
      int i = itemIdx.get_global_linear_id();
      int buf = 1;
      for (int j = 0; j < 51200; j++) {
        buf += i + j;
      }
      a[i] = buf;
    });
  });

  // Insert a barrier waiting for this kernel into queueB.
  const std::vector<cl::sycl::event> waitlist{e};
  qB.submit_barrier(waitlist);

  auto t_bufferResetSubmissionStart = std::chrono::system_clock::now();
#ifdef ACCESSOR
  // Try to use host accessor to write to the bufferB
  {
    // This will wait for KernelA
    auto h_bufferB = bufferB.get_access<mode::discard_write>();
    for (int i = 0; i < size; i++) {
      h_bufferB[i] = i;
    }
  }
#else
  // Manually fill a host buffer and copy it to bufferB
  for (int i = 0; i < size; i++) {
    x[i] = i;
  }
  // The copy will not start before kernelA completes, but at least the CPU
  // thread will move on.
  qB.submit([&](cl::sycl::handler &cgh) {
    auto acc = bufferB.get_access<mode::discard_write>(cgh);
    cgh.copy(x.data(), acc);
  });
#endif
  auto t_bufferResetSubmissionEnd = std::chrono::system_clock::now();

  std::cout
      << "Buffer reset submission took "
      << (t_bufferResetSubmissionEnd - t_bufferResetSubmissionStart).count()
      << std::endl;
  qA.wait_and_throw();
  qB.wait_and_throw();

  return 0;
}
# Using the manual buffering, similar timings with OpenCL and LevelZero
clang++ -fsycl sycl_opencl_scheduling.cpp -O2 -g -o test && SYCL_DEVICE_FILTER=opencl:gpu ./test
# >>> Buffer reset submission took 992008
clang++ -fsycl sycl_opencl_scheduling.cpp -O2 -g -o test && SYCL_DEVICE_FILTER=level_zero:gpu ./test
# >>> Buffer reset submission took 1166178

# Using the host accessor; LevelZero is a bit faster, but OpenCL is much slower due to wait
clang++ -fsycl -DACCESSOR=1 sycl_opencl_scheduling.cpp -O2 -g -o test && SYCL_DEVICE_FILTER=opencl:gpu ./test
# >>> Buffer reset submission took 283846156
clang++ -fsycl -DACCESSOR=1 sycl_opencl_scheduling.cpp -O2 -g -o test && SYCL_DEVICE_FILTER=level_zero:gpu ./test
# >>> Buffer reset submission took 648947

Environment (please complete the following information):

  • OS: Linux (Ubuntu 20.10)
  • Target device and vendor: Intel(R) Iris(R) Xe Graphics [0x9a49]
  • DPC++ version: clang version 13.0.0 (https://github.com/intel/llvm c4d08f5f7bfcdb2b9d0866bb7a430d6bbfd4c747)
  • Dependencies version: Intel compute-runtime 21.12.19358, level_zero 1.0.19358

al42and avatar Apr 13 '21 14:04 al42and

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days.

github-actions[bot] avatar Feb 17 '22 02:02 github-actions[bot]

Issue still present in IntelLLVM 962417da7c70e9c94ff767b05f266fa1add42f83, compute-runtime 22.04.22286.

al42and avatar Mar 04 '22 20:03 al42and

Apparently, the constructor of the host accessor calls clEnqueueMapBuffer with the barrier in its event_wait_list.

Could you please tell why you think so? Looking at logs with SYCL_PI_TRACE env var I see that only the first kernel is in the dependency list for clEnqueueMapBuffer.

If instead I create a host std::vector, fill it and then use cgh.copy(host_ptr, accessor), then the copy is submitted asynchronously, and the control is returned to the CPU thread pretty fast.

Right, copy operation is async and implies host to device copy only. Host accessor creation and destruction are sync operations and involve device to host + host to device copies in the worst case.

Also, it can look like the host accessor construction waits for the second kernel just because GPU serializes kernel execution and memory copy.

Could you please clarify what are the expectations here?

romanovvlad avatar Mar 09 '22 12:03 romanovvlad

Also, it can look like the host accessor construction waits for the second kernel just because GPU serializes kernel execution and memory copy.

That's correct, but I don't see why this access is being serialized here. With LevelZero, copy and execution overlap just fine, the problem only happens with OpenCL.

And if we create a third queue qC{dev}, submit the first kernel to it and don't use this queue any further, the host access time is 66 953 884 (vs. 421 272 145 when this operation is submitted to qB as in the example above). So, it's not like copy and execution could not overlap. They don't overlap only if the buffer is ever "touched" by a queue, even if that queue no longer operates on this buffer.

Of course, such behavior is not forbidden by the spec. But it looks suboptimal.

al42and avatar Mar 09 '22 15:03 al42and

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days.

github-actions[bot] avatar Sep 06 '22 02:09 github-actions[bot]

The issue still persists:

Using cgh::copy, the submission latency is even across all platforms:

$ clang++ -Wno-deprecated-declarations -fsycl-targets=nvptx64-nvidia-cuda,spir64 -fsycl sycl_opencl_scheduling.cpp -O2 -g -o test
$ SYCL_DEVICE_FILTER=level_zero:gpu ./test 
Buffer reset submission took 1086212
$ SYCL_DEVICE_FILTER=opencl:gpu ./test 
Buffer reset submission took 1209601
$ SYCL_DEVICE_FILTER=cuda:gpu ./test 
Buffer reset submission took 1641048

But when using host_accessor, OpenCL waits for KernelA before accessing with bufferB despite the kernel not using this buffer.

$ clang++ -Wno-deprecated-declarations -fsycl-targets=nvptx64-nvidia-cuda,spir64 -fsycl -DACCESSOR=1 sycl_opencl_scheduling.cpp -O2 -g -o test 
$ SYCL_DEVICE_FILTER=level_zero:gpu ./test 
Buffer reset submission took 762134
$ SYCL_DEVICE_FILTER=opencl:gpu ./test 
Buffer reset submission took 1266143617
$ SYCL_DEVICE_FILTER=cuda:gpu ./test 
Buffer reset submission took 4257067

System info:

$ sycl-ls
[opencl:gpu:0] Intel(R) OpenCL HD Graphics, Intel(R) HD Graphics 530 [0x1912] 3.0 [22.35.24055]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) HD Graphics 530 [0x1912] 1.3 [1.3.24055]
[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA GeForce GTX 1070 0.0 [CUDA 11.7]
[host:host:0] SYCL host platform, SYCL host device 1.2 [1.2]

$ clang++ --version
clang version 16.0.0 (https://github.com/intel/llvm 461288014c96cc9a1411f46e355811d6c34f2bdb)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/aland/intel-sycl/llvm/build/install//bin

al42and avatar Sep 07 '22 19:09 al42and

Cannot reproduce anymore with 19cd6144aaed64719475c4f4a0ee626bdbc428dc / Compute Runtime 23.35.

al42and avatar Dec 07 '23 19:12 al42and