While Trying to train with 4 gpus and arg.multiGPU = True, the following error occurs: torch.nn.modules.module.ModuleAttributeError: 'ModelU' object has no attribute 'lxrt_encoder'
Traceback (most recent call last):
File "hm.py", line 392, in
main()
File "hm.py", line 346, in main
hm = HM()
File "hm.py", line 91, in init
self.model.lxrt_encoder.multi_gpu()
File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 772, in getattr
type(self).name, name))
torch.nn.modules.module.ModuleAttributeError: 'ModelU' object has no attribute 'lxrt_encoder'
Even after commenting out the above line in hm.py file, the model seems to train on a single GPU only according to nvidia-smi.
If you want to run the models via Data Parallelism you will need to wrap them in torch.nn.DataParallel - In its current state the args.multiGPU does not work
Yeah, I tired doing that by changing:
self.model = self.model.cuda() to
self.model = nn.DataParallel(self.model.cuda())
but encountered this error:
Traceback (most recent call last):
File "hm.py", line 390, in
main()
File "hm.py", line 357, in main
hm.train(hm.train_tuple, hm.valid_tuple)
File "hm.py", line 184, in train
logit = self.model(sent, (feats, boxes))
File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
AssertionError: Caught AssertionError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/scratch/samyakxd/vilio/entryU.py", line 201, in forward
input_ids, img_feats, img_pos_feats, attn_masks, gather_index = self.preprocess_bert(sents, visual_feats, self.num_features, self.tokenizer)
File "/scratch/samyakxd/vilio/entryU.py", line 192, in preprocess_bert
gather_index = self.get_gather_index(txt_lens, num_bbs, bs, max_tl, out_size)
File "/scratch/samyakxd/vilio/entryU.py", line 211, in get_gather_index
assert len(txt_lens) == len(num_bbs) == batch_size
AssertionError
Yeah there is some preprocessing still happening in entryU - Maybe try instead wrapping
self.model, loading_info = BertU.from_pretrained(self.tr_name, img_dim=2048, output_loading_info=True) the self.model inside entryU with torch.nn.DataParallel?