nolearn_utils
nolearn_utils copied to clipboard
Improve concurrency for real-time augmentation
To prevent starving the GPU when using heavy real-time augmentation, BufferedBatchIteratorMixin
is available to use another process to build up a queue of augmented training samples. Most of the time, this works well enough.
However there are cases which the augmentation takes longer than the GPU forward-backward pass per batch. In the other words the CPU cannot keep up with the GPU.
The solution is to utilize multiple CPU cores to perform real-time augmentations. However this doesn't seems to be trivial for this to work efficiently because we need to pickle the iterator which has a __iter__
An implementation might be to do something similar to https://gist.github.com/ebenolson/072712792c46aa192797 and handle IPC ourselves with /run/shm/
. Also https://pypi.python.org/pypi/SharedArray might help as well.
Ultimately the idea is follow a producer-consumer pattern. Workers will generate and send augmented training sample to a master process. The master process will assemble the samples into batches and feed them into the GPU.
There will be 2 batch size parameters: one for GPU and one for the workers.
@dnouri will appreciate if you can offer some advice here