brainstorm double buffering

For double buffering to work properly, the new strategy of copying data to device in provide_external_data instead of the iterators requires some changes to how InputLayer works. This is a TODO, but it looks like there is another problem:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/training/trainer.py", line 157, in run_it
    net.provide_external_data(next(it))
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/structure/network.py", line 290, in provide_external_data
    self.handler.set_from_numpy(buf, data[name])
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/pycuda_handler.py", line 73, in set_from_numpy
    mem.set(arr.astype(self.dtype))
  File "/home/arkade/venv/py2/local/lib/python2.7/site-packages/pycuda/gpuarray.py", line 243, in set
    _memcpy_discontig(self, ary, async=async, stream=stream)
  File "/home/arkade/venv/py2/local/lib/python2.7/site-packages/pycuda/gpuarray.py", line 1190, in _memcpy_discontig
    drv.memcpy_htod(dst.gpudata, src)
LogicError: cuMemcpyHtoD failed: invalid device context

Sep 14 '15 15:09 flukeskywalker

CUDA contexts are transferable between threads, so this is weird, unless the thread runs in a different process? (which afaik isn't the case for python threads).

However, I'm a bit unsure about what double buffering is exactly meant to do? Why not just copy the data in a different stream?

Sep 14 '15 16:09 untom

Streams might be helpful, but I'm not sure we can get around the threading with streams here, since we need to also run next on the iterator while the forward pass is running.

Also we need to make sure that copying the data to the device doesn't overwrite the inputs, because they might still be needed (for the backward pass for example). So we need to manage a separate chunk of memory on the device to put the data in the background.

Sep 14 '15 16:09 Qwlouse

Double buffering is currently broken, because it overwrites the input data while the forward/backward pass is running. This is clearly a problem, because we might still need the old values. To fix this we would need to extend net.provide_external_data() such that it can write to another input buffer (thus the name double buffering :-) ). And then we need to either swap the two buffers or copy from one to the other.

Is anyone willing to do that within the next few days? Otherwise I think we should remove double buffering and tackle it properly after the release (in 10days!).

Oct 15 '15 20:10 Qwlouse

We will need a couple of extra ops it seems,

to allocate pagelocked memory around pycuda.driver.pagelocked_empty
to do an async transfer from host to device around pycuda.driver.memcpy_htod_async (or modify our set_from_numpy op) Oh, and the async transfer needs to be in a new stream.

I'm not sure that I will get to this in the next few days though.

Oct 15 '15 21:10 flukeskywalker