nolearn_utils icon indicating copy to clipboard operation
nolearn_utils copied to clipboard

Improve concurrency for real-time augmentation

Open felixlaumon opened this issue 8 years ago • 0 comments

To prevent starving the GPU when using heavy real-time augmentation, BufferedBatchIteratorMixin is available to use another process to build up a queue of augmented training samples. Most of the time, this works well enough.

However there are cases which the augmentation takes longer than the GPU forward-backward pass per batch. In the other words the CPU cannot keep up with the GPU.

The solution is to utilize multiple CPU cores to perform real-time augmentations. However this doesn't seems to be trivial for this to work efficiently because we need to pickle the iterator which has a __iter__

An implementation might be to do something similar to https://gist.github.com/ebenolson/072712792c46aa192797 and handle IPC ourselves with /run/shm/. Also https://pypi.python.org/pypi/SharedArray might help as well.

Ultimately the idea is follow a producer-consumer pattern. Workers will generate and send augmented training sample to a master process. The master process will assemble the samples into batches and feed them into the GPU.

There will be 2 batch size parameters: one for GPU and one for the workers.

@dnouri will appreciate if you can offer some advice here

felixlaumon avatar Mar 17 '16 09:03 felixlaumon