vilio icon indicating copy to clipboard operation
vilio copied to clipboard

Error with running code on multiple gpu's

Open samyakag opened this issue 3 years ago • 3 comments

While Trying to train with 4 gpus and arg.multiGPU = True, the following error occurs: torch.nn.modules.module.ModuleAttributeError: 'ModelU' object has no attribute 'lxrt_encoder' Traceback (most recent call last): File "hm.py", line 392, in main() File "hm.py", line 346, in main hm = HM() File "hm.py", line 91, in init self.model.lxrt_encoder.multi_gpu() File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 772, in getattr type(self).name, name)) torch.nn.modules.module.ModuleAttributeError: 'ModelU' object has no attribute 'lxrt_encoder'

Even after commenting out the above line in hm.py file, the model seems to train on a single GPU only according to nvidia-smi.

samyakag avatar Jan 17 '22 16:01 samyakag

If you want to run the models via Data Parallelism you will need to wrap them in torch.nn.DataParallel - In its current state the args.multiGPU does not work

Muennighoff avatar Jan 17 '22 17:01 Muennighoff

Yeah, I tired doing that by changing: self.model = self.model.cuda() to self.model = nn.DataParallel(self.model.cuda()) but encountered this error: Traceback (most recent call last): File "hm.py", line 390, in main() File "hm.py", line 357, in main hm.train(hm.train_tuple, hm.valid_tuple) File "hm.py", line 184, in train logit = self.model(sent, (feats, boxes)) File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) AssertionError: Caught AssertionError in replica 0 on device 0. Original Traceback (most recent call last): File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, **kwargs) File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/scratch/samyakxd/vilio/entryU.py", line 201, in forward input_ids, img_feats, img_pos_feats, attn_masks, gather_index = self.preprocess_bert(sents, visual_feats, self.num_features, self.tokenizer) File "/scratch/samyakxd/vilio/entryU.py", line 192, in preprocess_bert gather_index = self.get_gather_index(txt_lens, num_bbs, bs, max_tl, out_size) File "/scratch/samyakxd/vilio/entryU.py", line 211, in get_gather_index assert len(txt_lens) == len(num_bbs) == batch_size AssertionError

samyakag avatar Jan 17 '22 18:01 samyakag

Yeah there is some preprocessing still happening in entryU - Maybe try instead wrapping self.model, loading_info = BertU.from_pretrained(self.tr_name, img_dim=2048, output_loading_info=True) the self.model inside entryU with torch.nn.DataParallel?

Muennighoff avatar Jan 18 '22 05:01 Muennighoff