robosat Benchmark using pinned CUDA memory in data loaders

At the moment the data loaders load up images from the dataset, do pre-processing (like normalization), and then convert the images into tensors. Then we copy the data from CPU memory to GPU memory. This can be made more efficient by putting data into page-locked memory and using DMA to the data onto the GPU in async. fashion.

Look into functionality for pinning memory and async and non-blocking data transfers:

https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
https://pytorch.org/docs/stable/notes/cuda.html#use-pinned-memory-buffers

Note: the last time we used this we ran into some PyTorch-internal deadlocks. We need to carefully evaluate this, benchmark it, and figure out if it makes sense to go this route.

Tasks:

[x] Check out docs for cuda semantics
[x] Change memory copying behavior
[ ] Benchmark and test for both training as well as prediction

Jun 11 '18 17:06 daniel-j-h

I tested pinned memory leading to DMA copies in non-blocking mode. But I did not see any improvements on my 6xgtx1080ti rig where the bus seems to be the limiting factor.

Leaving this open in case anyone has a sandbox environment to see if it improves things.

To reproduce

use the pinned memory flag in all data loaders (train and predict tool)
use the non-blocking flag when copying tensors in the train/predict tool

May 27 '19 18:05 daniel-j-h

@daniel-j-h I confirm i did'nt seen any significant perf improvement throught PyTorch CUDA pinned setting.

On the other hands, among identified points mattering for training:

Several DataLoader processes (to be sure GPU is about ~100%)
More efficient Data Augmentation step (switching to Albumentations -and- removing tiles buffering keep accuracy but about x3 faster)

HTH,

May 27 '19 18:05 ocourtin

Here is why we didn't see any improvement with pin_memory=True:

all our datasets return tuples, e.g. the tile tensor but then also the tile z, x, y ids
the default pytorch mechanism only supports pinning tensors directly

Per https://pytorch.org/docs/stable/data.html#memory-pinning

The default memory pinning logic only recognizes Tensors and maps and iterables containing Tensors. By default, if the pinning logic sees a batch that is a custom type (which will occur if you have a collate_fn that returns a custom batch type), or if each element of your batch is a custom type, the pinning logic will not recognize them, and it will return that batch (or those elements) without pinning the memory. To enable memory pinning for custom batch or data type(s), define a pin_memory() method on your custom type(s).

Therefore even if we set pin_memory=True it will just silently fail.

Solution is to write a custom type with its own pin_memory function pinning the tensors to memory.

Sep 14 '19 13:09 daniel-j-h

@ocourtin re.

On the other hands, among identified points mattering for training:

* Several DataLoader processes (to be sure GPU is about ~100%)

* More efficient Data Augmentation step (switching to Albumentations
  -and- removing tiles buffering keep accuracy but about x3 faster)

Adding here: switching to libjpeg-turbo and Pillow-SIMD gave me a huge boost during pre-processing.

See https://github.com/mapbox/robosat/pull/180

Sep 14 '19 13:09 daniel-j-h

robosat robosat copied to clipboard

Benchmark using pinned CUDA memory in data loaders

robosat
robosat copied to clipboard