sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Harness CUDA's shared memory to speed up GPU calculations

Open aktech opened this issue 4 years ago • 0 comments

We'll soon have some GPU implementation of Pairwise distance functionality: https://github.com/pystatgen/sgkit/pull/498

Primer on GPU and the Problem

The architecture of a GPU is divided into grids and each grid contains some blocks and each block contains threads and threads is where all the calculation happens.

During any calculation each thread reads from DRAM except unless if finds the item in some cache. Now reading from DRAM is slow (read that very slow).

Example:

Imagine a situation where we calculate pairwise distance between a bunch of vectors in a chunk of 2D vector:

[
    v0,
    v1,
    v2,
    v3,
]

Each of these calculations are done by separate threads, now here you can see:

  • Thread 1 will calculate (v0, v1),
  • Thread 2 will calculate(v0, v2),

now can you can see here Thread 1 and Thread 2 both are loading v0 array from the memory and so on...

Solution

Numba's CUDA API provides a way to share memory between threads in a block, this will let the threads in a block load data into shared memory only once and reuse that every time by all the threads.

Reference API: https://numba.pydata.org/numba-doc/latest/cuda/memory.html#shared-memory-and-thread-synchronization

This exercise can give us significant speed up in GPU calculations.

aktech avatar Apr 06 '21 17:04 aktech