neural-motifs
neural-motifs copied to clipboard
training rel detector using multi gpus
Hi, I have successfully trained the detector using multiple gpus (8). But I have the following issue when training rel detector using more than one GPUs (have tried on 1080 ti, p100 and K40)
Traceback (most recent call last):
File "/home/wtliao/work_space/neural-motifs-master-backup/models/train_rels.py", line 229, in <module>
rez = train_epoch(epoch)
File "/home/wtliao/work_space/neural-motifs-master-backup/models/train_rels.py", line 135, in train_epoch
tr.append(train_batch(batch, verbose=b % (conf.print_interval*10) == 0)) #b == 0))
File "/home/wtliao/work_space/neural-motifs-master-backup/models/train_rels.py", line 179, in train_batch
loss.backward()
File "/home/wtliao/anaconda2/envs/mofit/lib/python3.6/site-packages/torch/autograd/variable.py", line 167, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/home/wtliao/anaconda2/envs/mofit/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
variables, grad_variables, retain_graph)
RuntimeError: narrow is not implemented for type UndefinedType
The code works well for single gpu. I have no idea about that at all and I cant find a sollution by google. Do you have any idea about that? Thanks
sorry, I don't support training the relationship model with multiple GPUs right now (it's not what I used for these experiments). I found it actually doesn't help too much in terms of speedup, as the LSTMs are kinda slow and hard to parallelize.
Thanks. Get it. The issuse happens at backward of LSTM.