Etaler
Etaler copied to clipboard
Optimize data transfer between OpenCL devices.
Currently copying data between 2 OpenCL backends are done by
- Allocating a temporary buffer
- Copy data from GPU1 to buffer
- Copy data from buffer to GPU2
- release the buffer
Which is slow. There are other more optimized routes. But the mechanism to trigger it is yet to determined.
Sol 1: Using clEnqueueMapBuffer
- Map the memory from GPU1 to CPU (a pre-pinned DMA transfer)
- Copy data from buffer to GPU2
- unmap buffer
Sol 2: With OpenCL 2.0's Shared Virtual Memory. Host memory is not touched, super fast.
- Allocate Tensors as SVM buffers
- Ask GPU2 to copy data from GPU1
They should make multi-gpu faster.
Sol 3:
- Make a copy of tensor on GPU1 (clCreateBuffer && clEnqueueCopyBuffer)
- Migrate the buffer (clEnqueueMigrateMemObject) from GPU1 to GPU2
But it is still not optimal that we need to copy a buffer on GPU1
Apparently Nvidia does have some OpenCL 2.0 support. https://streamhpc.com/blog/2017-02-22/nvidia-enables-opencl-2-0-beta-support/
Seems I can build OpenCL 2.0 code before grabbing myself a Navi card.