robosat
robosat copied to clipboard
Benchmark using pinned CUDA memory in data loaders
At the moment the data loaders load up images from the dataset, do pre-processing (like normalization), and then convert the images into tensors. Then we copy the data from CPU memory to GPU memory. This can be made more efficient by putting data into page-locked memory and using DMA to the data onto the GPU in async. fashion.
Look into functionality for pinning memory and async and non-blocking data transfers:
- https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
- https://pytorch.org/docs/stable/notes/cuda.html#use-pinned-memory-buffers
Note: the last time we used this we ran into some PyTorch-internal deadlocks. We need to carefully evaluate this, benchmark it, and figure out if it makes sense to go this route.
Tasks:
- [x] Check out docs for cuda semantics
- [x] Change memory copying behavior
- [ ] Benchmark and test for both training as well as prediction
I tested pinned memory leading to DMA copies in non-blocking mode. But I did not see any improvements on my 6xgtx1080ti rig where the bus seems to be the limiting factor.
Leaving this open in case anyone has a sandbox environment to see if it improves things.
To reproduce
- use the pinned memory flag in all data loaders (train and predict tool)
- use the non-blocking flag when copying tensors in the train/predict tool
@daniel-j-h I confirm i did'nt seen any significant perf improvement throught PyTorch CUDA pinned setting.
On the other hands, among identified points mattering for training:
- Several DataLoader processes (to be sure GPU is about ~100%)
- More efficient Data Augmentation step (switching to Albumentations -and- removing tiles buffering keep accuracy but about x3 faster)
HTH,
Here is why we didn't see any improvement with pin_memory=True
:
- all our datasets return tuples, e.g. the tile tensor but then also the tile z, x, y ids
- the default pytorch mechanism only supports pinning tensors directly
Per https://pytorch.org/docs/stable/data.html#memory-pinning
The default memory pinning logic only recognizes Tensors and maps and iterables containing Tensors. By default, if the pinning logic sees a batch that is a custom type (which will occur if you have a collate_fn that returns a custom batch type), or if each element of your batch is a custom type, the pinning logic will not recognize them, and it will return that batch (or those elements) without pinning the memory. To enable memory pinning for custom batch or data type(s), define a pin_memory() method on your custom type(s).
Therefore even if we set pin_memory=True
it will just silently fail.
Solution is to write a custom type with its own pin_memory
function pinning the tensors to memory.
@ocourtin re.
On the other hands, among identified points mattering for training:
* Several DataLoader processes (to be sure GPU is about ~100%) * More efficient Data Augmentation step (switching to Albumentations -and- removing tiles buffering keep accuracy but about x3 faster)
Adding here: switching to libjpeg-turbo and Pillow-SIMD gave me a huge boost during pre-processing.
See https://github.com/mapbox/robosat/pull/180