theano_alexnet icon indicating copy to clipboard operation
theano_alexnet copied to clipboard

Execute alexnet using ligbpuarray backend

Open deepali-c opened this issue 7 years ago • 8 comments

I am trying to execute alexnet using new libgpuarray backend for 1 gpu. The modifications that I have done to the 1 gpu sample are as in - 1gpu_libgpuarray_patch.txt

However, with these changes I get following error -

ValueError: ('The following error happened while compiling the node', DnnVersion(), '\n', 'context name None is already defined') Complete error log - 1gpu_libgpuarray_error.txt

Further updating train.py to use theano.gpuarray.use("cuda")instead of theano.gpuarray.use(config['gpu']) then it starts training. But I don't think that this is correct. Please advise.

deepali-c avatar Apr 04 '17 13:04 deepali-c

@deepali-c

The changes for making the single GPU train.py working would involve changing any sandbox.cuda functions to gpuarray alternatives, using device='cuda0' instead of device='gpu0', and moving any import theano after setting up device context. The device context can be set up like this:

https://github.com/uoguelph-mlrg/Theano-MPI/blob/master/theanompi/models/test_model.py#L11

The parallel loading part is also different but still based on socket and ipc handle. See

https://github.com/uoguelph-mlrg/Theano-MPI/blob/master/theanompi/models/data/proc_load_mpi.py#L126

The two GPU version train_2gpu.py will require replacing the pycuda.device_d2d() and the summation function with the pygpu.collectives.allreduce() function.

As the method used in theano_alexnet has a strong dependency on the CudaNdarray Backend of Theano version < 0.9 and the pycuda library, we'd better make a new branch for trying the new GPUArray Backend. However, this will just redo some parts of Theano-MPI.

hma02 avatar Apr 04 '17 16:04 hma02

I have tried below mentioned two approaches, for using new backend with 1 gpu :

  1. Modified train.py to setup device context as in Theano-MPI: (Along with changes to use gpuarray alternatives)
    import os
    os.environ['THEANO_FLAGS'] = 'device={0}'.format(config['gpu'])
    import theano.gpuarray
    # This is a bit of black magic that may stop working in future
    # theano releases
    ctx = theano.gpuarray.type.get_context(None)

This gives the following error:

THEANO_FLAGS=device=cuda0,mode=FAST_RUN,floatX=float32 python train.py
....
...#more output here
.....
`TypeError: Cannot convert Type TensorType(float64, 4D) (of Variable HostFromGpu(gpuarray).0) into Type TensorType(float32, 4D). You can try to manually convert HostFromGpu(gpuarray).0 into a TensorType(float32, 4D).`
  1. Updated train.py according to the patch I have shared in this thread previously then it works fine.

The difference is that in #1 I am trying to setup device context using the new method, instead of the pycuda gpu setup. Looks like I have missed something while doing so, please advise.

deepali-c avatar Jul 10 '17 11:07 deepali-c

@deepali-c

The error looks like something with floatX.

Anyways, I just created a pygpu branch. And the single GPU train.py is working. You can compare your patch with this commit to see what are the necessary changes.

There's some dependency difference. To use this branch, I recommend upgrading to the bleeding edge libgpuarray/pygpu and theano. I just tried on them and it's working.

 $ backend=gpuarray python train.py
Using cuDNN version 5110 on context None
Mapped name None to device cuda0: GeForce GTX TITAN Black (0000:83:00.0)
... building the model
conv (cudnn) layer with shape_in: (3, 227, 227, 256)
conv (cudnn) layer with shape_in: (96, 27, 27, 256)
conv (cudnn) layer with shape_in: (256, 13, 13, 256)
conv (cudnn) layer with shape_in: (384, 13, 13, 256)
conv (cudnn) layer with shape_in: (384, 13, 13, 256)
fc layer with num_in: 9216 num_out: 4096
dropout layer with P_drop: 0.5
fc layer with num_in: 4096 num_out: 4096
dropout layer with P_drop: 0.5
softmax layer with num_in: 4096 num_out: 1000
... training
shared_x information received
img_mean received
training @ iter =  0
training cost: 6.91343069077
training error rate: 1.0
time per 20 iter: 28.7199730873

hma02 avatar Jul 10 '17 21:07 hma02

@hma02 , Thank you so much for these changes. The 1 gpu example works with libgpuarray on my setup as well. I am working on the 2 gpu sample next.

deepali-c avatar Jul 11 '17 11:07 deepali-c

@deepali-c

I just made the train_2gpu.py working based on pygpu collectives, which is then based on NCCL. So you need to install NCCL, libgpuarray and its wrapper pygpu in order to run this.

hma02 avatar Aug 11 '17 23:08 hma02

Thanks @hma02 .

I observed the following error while executing the 2 gpu sample with gpuarray backend:

Process Process-2:
Process Process-1:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "train_2gpu.py", line 356, in train_net
    self._target(*self._args, **self._kwargs)
  File "train_2gpu.py", line 356, in train_net
    gpu_send_queue.put(this_val_error)
    gpu_send_queue.put(this_val_error)
UnboundLocalError: local variable 'gpu_send_queue' referenced before assignment
UnboundLocalError: local variable 'gpu_send_queue' referenced before assignment

I made the following change in the code - train_2gpu.py and then it could proceed without the above error.

-        gpu_send_queue.put(this_val_error)
-        that_val_error = gpu_recv_queue.get()
-        this_val_error = (this_val_error + that_val_error) / 2.
-
-        gpu_send_queue.put(this_val_loss)
-        that_val_loss = gpu_recv_queue.get()
-        this_val_loss = (this_val_loss + that_val_loss) / 2.
+        if os.environ['backend']=='gpuarray':
+           exch.exchange()
+       else:
+           gpu_send_queue.put(this_val_error)
+           that_val_error = gpu_recv_queue.get()
+           this_val_error = (this_val_error + that_val_error) / 2.
+
+           gpu_send_queue.put(this_val_loss)
+           that_val_loss = gpu_recv_queue.get()
+           this_val_loss = (this_val_loss + that_val_loss) / 2.

deepali-c avatar Aug 14 '17 10:08 deepali-c

@deepali-c

Sorry, I forgot to debug the validation part. See the last commit regarding this issue.

The exch.exchange() is for exchanging the total_params. Here what we need is to average the validation error and cost over two workers, which is similar but exch is an instance already bound to those total_params so it won't help on doing this.

hma02 avatar Aug 14 '17 13:08 hma02

Thanks @hma02.

I got it now.

deepali-c avatar Aug 16 '17 06:08 deepali-c