neural-motifs icon indicating copy to clipboard operation
neural-motifs copied to clipboard

training rel detector using multi gpus

Open wtliao opened this issue 7 years ago • 2 comments
trafficstars

Hi, I have successfully trained the detector using multiple gpus (8). But I have the following issue when training rel detector using more than one GPUs (have tried on 1080 ti, p100 and K40)

Traceback (most recent call last):
  File "/home/wtliao/work_space/neural-motifs-master-backup/models/train_rels.py", line 229, in <module>
    rez = train_epoch(epoch)
  File "/home/wtliao/work_space/neural-motifs-master-backup/models/train_rels.py", line 135, in train_epoch
    tr.append(train_batch(batch, verbose=b % (conf.print_interval*10) == 0)) #b == 0))
  File "/home/wtliao/work_space/neural-motifs-master-backup/models/train_rels.py", line 179, in train_batch
    loss.backward()
  File "/home/wtliao/anaconda2/envs/mofit/lib/python3.6/site-packages/torch/autograd/variable.py", line 167, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
  File "/home/wtliao/anaconda2/envs/mofit/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    variables, grad_variables, retain_graph)
RuntimeError: narrow is not implemented for type UndefinedType

The code works well for single gpu. I have no idea about that at all and I cant find a sollution by google. Do you have any idea about that? Thanks

wtliao avatar Nov 02 '18 09:11 wtliao

sorry, I don't support training the relationship model with multiple GPUs right now (it's not what I used for these experiments). I found it actually doesn't help too much in terms of speedup, as the LSTMs are kinda slow and hard to parallelize.

rowanz avatar Nov 02 '18 12:11 rowanz

Thanks. Get it. The issuse happens at backward of LSTM.

wtliao avatar Nov 05 '18 03:11 wtliao