CodeT5 icon indicating copy to clipboard operation
CodeT5 copied to clipboard

Scalar issue: Data Parallel with 2 core GPU

Open eswarthammana opened this issue 2 years ago • 6 comments

Dear Team,

I tried to train the model with 2 core GPU as 0,1 I faced the following problem, which i have not faced with 1 core GPU. Could you please help me to solve the issue.

Environment: Kaggle Accelerator: GPU T4 x 2

/opt/conda/lib/python3.7/site-packages/transformers/optimization.py:395: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning FutureWarning, Training: 0%| | 0/3125 [00:00<?, ?it/s]/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector. warnings.warn('Was asked to gather along dimension 0, but all ' [0] Train loss 0.258: 100%|██████████| 3125/3125 [29:17<00:00, 1.78it/s] 100%|██████████| 2000/2000 [00:07<00:00, 273.69it/s] Eval ppl: 0%| | 0/63 [00:00<?, ?it/s] Traceback (most recent call last): File "/kaggle/working/CodeT5/run_gen.py", line 387, in main() File "/kaggle/working/CodeT5/run_gen.py", line 265, in main eval_ppl = eval_ppl_epoch(args, eval_data, eval_examples, model, tokenizer) File "/kaggle/working/CodeT5/run_gen.py", line 75, in eval_ppl_epoch eval_loss += loss.item() ValueError: only one element tensors can be converted to Python scalars

eswarthammana avatar Apr 17 '23 04:04 eswarthammana

I faced a similar issue. I added a condition like below in run_gen.py (line 75):

outputs = model(input_ids=source_ids, attention_mask=source_mask,
                labels=target_ids, decoder_attention_mask=target_mask)
loss = outputs.loss
if args.n_gpu > 1:
    loss = loss.mean()

It now works for me.

alibrahimzada avatar May 04 '23 21:05 alibrahimzada

Hi, I'm unable to finetune with multiple GPUs. Can @eswarthammana or @alibrahimzada tell me about any modifications required to the scripts for this?

Tx

Sleepyhead01 avatar May 30 '23 19:05 Sleepyhead01

make sure you execute your script with torchrun rather than python3/python. I don't think there are other requirements for multi-GPU execution.

alibrahimzada avatar May 30 '23 23:05 alibrahimzada

Hi @Sleepyhead01,

the one i tried is with in exp_with_args.sh at the end of the file CUDA_VISIBLE_DEVICES=${GPU} modify the ${GPU} value as 0, 1 through code it accepts only integer we cannot pass more than one value.

As @alibrahimzada mentioned modify the loss as loss.mean()

eswarthammana avatar May 31 '23 04:05 eswarthammana

Training with multiple GPUs starts with this modification. However, eval_bleu_epoch gives the following error:

Traceback (most recent call last):
  File "CodeT5/run_gen.py", line 392, in <module>
    main()
  File "CodeT5/run_gen.py", line 319, in main
    result = eval_bleu_epoch(args, eval_data, eval_examples, model, tokenizer, 'dev', 'e%d' % cur_epoch)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "CodeT5/run_gen.py", line 109, in eval_bleu_epoch
    preds = model.generate(source_ids,
            ^^^^^^^^^^^^^^
  File "anaconda3/envs/Old_R/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DataParallel' object has no attribute 'generate'

Any fix for this? Tx

Sleepyhead01 avatar May 31 '23 18:05 Sleepyhead01

@Sleepyhead01 you need to do model.module.generate() because for n_gpu > 1... model is an attribute of DataParallel. To get the model, you should call .module on it.

Unfortunately the authors have not maintained these scripts with newer versions of torch.

alibrahimzada avatar May 31 '23 18:05 alibrahimzada