v6d icon indicating copy to clipboard operation
v6d copied to clipboard

[WIP] Support GPU `bulkstore` in vineyardd

Open CSWater opened this issue 2 years ago • 1 comments

What do these changes do?

FEATURE-759, try to enable GPU memory object sharing. For now, give a naïve implementation for createGPUBuffer.

Design Brief

  1. Implement GPUBulkStore with the same functionality interface as BulkStore.
  2. Implement a GPU memory allocator, for now, naïve "allocate-on-request" is adopted.
  3. Client and Server add gpu-specific session support.
  4. Server sends the cudaIpcMemHandle to client in reply, client open the handle to get access to the shared GPU memory.

New class:

  • GPUBulkStore: with the same interface as BulkStore.
  • GPUBulkAllocator: with the same interface as BulkAllocator.
  • GPUAllocator: GPU memory allocator, it is left for future work.

Main Related class:

  • Client: add GPU-specific request method, like createGPUBuffer.
  • VineyardServer, SocketConnection: add GPU-specific session support.
  • Payload: add a representation bit is_gpu to distinguish GPU object from CPU object.

Detailed Design

  • GPU Memory
    • Implement GPUBulkStore and GPUBulkAllocator for managing objects on gpu.
    • GPUAlloctor similar to dlmalloc is not implemented.
  • Client
    • Add new method, like createGPUBuffer, send GPU-specific request to server.
    • Client will receive a cudaIpcMemHandle_t handle in reply and open the handle to access to the shared gpu memory .
  • Server
    • add gpu_bulk_store_ for managing gpu objects.
    • add gpu-specific session support and response to client gpu-specific request.
    • gpu object sharing is accomplished by getting a cudaIpcMemHandle handle and sending the handle to client in reply.
  • Client/Server interaction
    • add necessary protocols, like CreateGPUBufferRequest and associated functions.

Related issue number

Feature #759 cc @septicmk

CSWater avatar Jul 24 '22 13:07 CSWater

Hi @CSWater, I have left some comments for you, I suggest you consider the following at an early stage:

  • Make the GPU-feature optional.
  • Allocate the GPU memory on-demand, building a custom GPU allocator is not in the plan.
  • Disable the ColdObjectTracker, since this is another orthogonal under-construction feature.
  • We still need the unit test but don't trigger it in CI since we do not have an online GPU environment.

mengke-mk avatar Jul 25 '22 02:07 mengke-mk

Hi @CSWater, I have following suggestion:

  • We should pushdown the ENABLE_GPU marcos in /gpu to make the regular code path more clear. GPU-related methods are always callable but return Status::NotImplement() when GPU feature is not toggled on.
  • To support heterogeneous objects (blobs are allocated in both GPU and CPU) we should merge GPUBulkStore and BulkStore. You can add something like create_gpu and delete_gpu in the BulkStore.
  • Rebase the code and solve the conflict after you have finalized your code.

mengke-mk avatar Aug 19 '22 01:08 mengke-mk

Codecov Report

Merging #876 (dfcc2a7) into main (d77c3e2) will decrease coverage by 1.72%. The diff coverage is 4.76%.

:exclamation: Current head dfcc2a7 differs from pull request most recent head 420c28c. Consider uploading reports for the commit 420c28c to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #876      +/-   ##
==========================================
- Coverage   70.70%   68.98%   -1.73%     
==========================================
  Files          77       80       +3     
  Lines        8030     8234     +204     
==========================================
+ Hits         5678     5680       +2     
- Misses       2352     2554     +202     
Impacted Files Coverage Δ
src/client/client.cc 65.68% <0.00%> (-2.28%) :arrow_down:
src/client/client.h 100.00% <ø> (ø)
src/common/memory/gpu/unified_memory.cc 0.00% <0.00%> (ø)
src/common/memory/gpu/unified_memory.h 0.00% <0.00%> (ø)
src/common/util/protocols.cc 70.10% <0.00%> (-4.85%) :arrow_down:
src/server/async/socket_server.cc 77.71% <0.00%> (-1.54%) :arrow_down:
src/server/async/socket_server.h 100.00% <ø> (ø)
src/server/memory/gpu/gpuallocator.cc 0.00% <0.00%> (ø)
src/server/memory/memory.h 100.00% <ø> (ø)
src/server/server/vineyard_server.h 100.00% <ø> (ø)
... and 7 more

codecov[bot] avatar Aug 25 '22 08:08 codecov[bot]