double buffering
For double buffering to work properly, the new strategy of copying data to device in provide_external_data instead of the iterators requires some changes to how InputLayer works. This is a TODO, but it looks like there is another problem:
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/training/trainer.py", line 157, in run_it
net.provide_external_data(next(it))
File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/structure/network.py", line 290, in provide_external_data
self.handler.set_from_numpy(buf, data[name])
File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/pycuda_handler.py", line 73, in set_from_numpy
mem.set(arr.astype(self.dtype))
File "/home/arkade/venv/py2/local/lib/python2.7/site-packages/pycuda/gpuarray.py", line 243, in set
_memcpy_discontig(self, ary, async=async, stream=stream)
File "/home/arkade/venv/py2/local/lib/python2.7/site-packages/pycuda/gpuarray.py", line 1190, in _memcpy_discontig
drv.memcpy_htod(dst.gpudata, src)
LogicError: cuMemcpyHtoD failed: invalid device context
CUDA contexts are transferable between threads, so this is weird, unless the thread runs in a different process? (which afaik isn't the case for python threads).
However, I'm a bit unsure about what double buffering is exactly meant to do? Why not just copy the data in a different stream?
Streams might be helpful, but I'm not sure we can get around the threading with streams here, since we need to also run next on the iterator while the forward pass is running.
Also we need to make sure that copying the data to the device doesn't overwrite the inputs, because they might still be needed (for the backward pass for example). So we need to manage a separate chunk of memory on the device to put the data in the background.
Double buffering is currently broken, because it overwrites the input data while the forward/backward pass is running. This is clearly a problem, because we might still need the old values.
To fix this we would need to extend net.provide_external_data() such that it can write to another input buffer (thus the name double buffering :-) ). And then we need to either swap the two buffers or copy from one to the other.
Is anyone willing to do that within the next few days? Otherwise I think we should remove double buffering and tackle it properly after the release (in 10days!).
We will need a couple of extra ops it seems,
- to allocate pagelocked memory around
pycuda.driver.pagelocked_empty - to do an async transfer from host to device around
pycuda.driver.memcpy_htod_async(or modify ourset_from_numpyop) Oh, and the async transfer needs to be in a new stream.
I'm not sure that I will get to this in the next few days though.