CodeT5
CodeT5 copied to clipboard
Scalar issue: Data Parallel with 2 core GPU
Dear Team,
I tried to train the model with 2 core GPU as 0,1 I faced the following problem, which i have not faced with 1 core GPU. Could you please help me to solve the issue.
Environment: Kaggle Accelerator: GPU T4 x 2
/opt/conda/lib/python3.7/site-packages/transformers/optimization.py:395: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
FutureWarning,
Training: 0%| | 0/3125 [00:00<?, ?it/s]/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
[0] Train loss 0.258: 100%|██████████| 3125/3125 [29:17<00:00, 1.78it/s]
100%|██████████| 2000/2000 [00:07<00:00, 273.69it/s]
Eval ppl: 0%| | 0/63 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/kaggle/working/CodeT5/run_gen.py", line 387, in
I faced a similar issue. I added a condition like below in run_gen.py (line 75):
outputs = model(input_ids=source_ids, attention_mask=source_mask,
labels=target_ids, decoder_attention_mask=target_mask)
loss = outputs.loss
if args.n_gpu > 1:
loss = loss.mean()
It now works for me.
Hi, I'm unable to finetune with multiple GPUs. Can @eswarthammana or @alibrahimzada tell me about any modifications required to the scripts for this?
Tx
make sure you execute your script with torchrun rather than python3/python. I don't think there are other requirements for multi-GPU execution.
Hi @Sleepyhead01,
the one i tried is with in exp_with_args.sh at the end of the file CUDA_VISIBLE_DEVICES=${GPU} modify the ${GPU} value as 0, 1 through code it accepts only integer we cannot pass more than one value.
As @alibrahimzada mentioned modify the loss as loss.mean()
Training with multiple GPUs starts with this modification. However, eval_bleu_epoch gives the following error:
Traceback (most recent call last):
File "CodeT5/run_gen.py", line 392, in <module>
main()
File "CodeT5/run_gen.py", line 319, in main
result = eval_bleu_epoch(args, eval_data, eval_examples, model, tokenizer, 'dev', 'e%d' % cur_epoch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "CodeT5/run_gen.py", line 109, in eval_bleu_epoch
preds = model.generate(source_ids,
^^^^^^^^^^^^^^
File "anaconda3/envs/Old_R/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DataParallel' object has no attribute 'generate'
Any fix for this? Tx
@Sleepyhead01 you need to do model.module.generate() because for n_gpu > 1... model is an attribute of DataParallel. To get the model, you should call .module on it.
Unfortunately the authors have not maintained these scripts with newer versions of torch.