style-transfer-paraphrase
style-transfer-paraphrase copied to clipboard
Error when using multiple GPUs
Thank you for the awesome and interesting research and project. I was wondering if anyone has encountered the following error when using multiple gpu. I have 4 Titan V gpus and to use them I've set the local rank to -1. But it seems a problem occurs during the forward calculation.
export DATA_DIR=datasets/paranmt_filtered
source style-venv/bin/activate
BASE_DIR=style_paraphrase
python -m torch.distributed.launch --nproc_per_node=1 $BASE_DIR/run_lm_finetuning.py \
--output_dir=$BASE_DIR/saved_models/test_paraphrase \
--model_type=gpt2 \
--model_name_or_path=gpt2-large \
--data_dir=$DATA_DIR \
--do_train \
--save_steps 500 \
--logging_steps 20 \
--save_total_limit -1 \
--evaluate_during_training \
--num_train_epochs 3 \
--gradient_accumulation_steps 2 \
--per_gpu_train_batch_size 5 \
--per_gpu_eval_batch_size 5 \
--job_id paraphraser_test \
--learning_rate 5e-5 \
--prefix_input_type original \
--global_dense_feature_list none \
--specific_style_train -1 \
--optimizer adam \
--fp16 \
--fp16_opt_level "O3" \
--overwrite_output_dir \
--local_rank -1
Traceback (most recent call last):
File "style_paraphrase/run_lm_finetuning.py", line 505, in <module>
main()
File "style_paraphrase/run_lm_finetuning.py", line 422, in main
global_step, tr_loss = train(args, gpt2_model, train_dataset, tokenizer)
File "style_paraphrase/run_lm_finetuning.py", line 228, in train
loss = gpt2_model(batch)
File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/workspace/style-transformer/style-transfer-paraphrase/style_paraphrase/utils.py", line 87, in forward
labels=labels
File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 511, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/apex/amp/_initialize.py", line 197, in new_fwd
**applier(kwargs, input_caster))
File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1059, in forward
return_dict=return_dict,
File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 832, in forward
inputs_embeds = self.wte(input_ids)
File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 126, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/functional.py", line 1814, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorIndex.cu:403
hi @eomiso,
Did you try --nproc_per_node=4
? Also I don't think you have to set local_rank
explicitly, i think PyTorch does it for you