end2end-asr-pytorch icon indicating copy to clipboard operation
end2end-asr-pytorch copied to clipboard

a problem in multi-GPU training at the last batch

Open paanguin opened this issue 5 years ago • 3 comments
trafficstars

The problem appeared after https://github.com/gentaiscool/end2end-asr-pytorch/issues/24 so that I preserve the link. But I think it is another issue. Thank you for solving the problem.

I merged your code on master branch, however, I got another error with multi-gpu running. This issue only appears at the last batch, which is smaller than normal batches. Thus actually, I could by-pass this issue by using drop_last=True option in train_loader and valid_loader variables in train.py.

I tried to reproduce the error with a toy example. It seems to be related to the batch size and number of GPUs. I made data list files that contains 10 examples each. I ran the example with 4 GPUs with batch size 8.

python train.py --train-manifest-list ~/asr/data/librispeech/libri_asis_10 --valid-manifest-list ~/asr/data/librispeech/libri_dev_10 --test-manifest-list ~/asr/data/librispeech/libri_test_clean --labels-path data/labels/labels.json --cuda --save-every 1 --save-folder trained_models/ --name librispeech_drop0.1_cnn_batch12_4_vgg_layer4_lr0.1 --epochs 10 --cuda --batch-size 8 --lr 0.1 --save-folder save/ --save-every 1 --feat_extractor vgg_cnn --dropout 0.1 --num-layers 4 --num-heads 8 --dim-model 512 --dim-key 64 --dim-value 64 --dim-input 161 --dim-inner 2048 --dim-emb 512 --shuffle --min-lr 1e-6 --k-lr 1 --parallel --device-ids 0 1 2 3

And I encountered the following error. The same error appeared when the batch size is not a multiple of the number of GPUs. For example, when the batch size is set to 6 while the number of GPUs is 4, the error occurs.

In the following run, it seems that then the device 2 does not have any examples in the last batch, since the number of the remaining examples at the last batch is 2 but the number of GPUs is 4. At this time, the device 0 and 1 seems to have examples, while the device 2 and device 3 do not.

I think this is natural situation, however, yet the program made an error. I also tested the official example at https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html, however, this example did not give me any errors. So I don't know why this error appears in this code.

I hope to know whether this error appears on your versions of codes or not. I'm suspicious that if I made some mistakes in the merging your codes into my own version of codes.

================================================== THE EXPERIMENT LOG IS SAVED IN: log/librispeech_drop0.1_cnn_batch12_4_vgg_layer4_lr0.1 TRAINING MANIFEST: ['/home/hh1208-kang/asr/data/librispeech/libri_asis_10'] VALID MANIFEST: ['/home/hh1208-kang/asr/data/librispeech/libri_dev_10'] TEST MANIFEST: ['/home/hh1208-kang/asr/data/librispeech/libri_test_clean'] ================================================== load with device_ids [0, 1, 2, 3] (Epoch 1) TRAIN LOSS:4.3495 CER:96.68% LR:0.0000010: 50%|███████████████████████████████████████████████ | 1/2 [00:09<00:09, 9.41s/it]Traceback (most recent call last): File "train.py", line 116, in trainer.train(model, train_loader, train_sampler, valid_loader_list, opt, loss_type, start_epoch, num_epochs, label2id, id2label, metrics) File "/home/hh1208-kang/end2end-asr-pytorch-user/trainer/asr/trainer.py", line 58, in train pred, gold, hyp_seq, gold_seq = model(src, src_lengths, tgt, verbose=False) File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, **kwargs) File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/_utils.py", line 394, in reraise raise self.exc_type(msg) TypeError: Caught TypeError in replica 2 on device 2. Original Traceback (most recent call last): File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, **kwargs) File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, **kwargs) TypeError: forward() missing 3 required positional arguments: 'padded_input', 'input_lengths', and 'padded_target'

p.s. This is my debugging example referencing the pytorch example, which might be helpful to explain the problem.

python train.py --train-manifest-list ~/asr/data/librispeech/libri_asis_10 --valid-manifest-list ~/asr/data/librispeech/libri_dev_10 --test-manifest-list ~/asr/data/librispeech/libri_test_clean --labels-path data/labels/labels.json --cuda --save-every 1 --save-folder trained_models/ --name librispeech_drop0.1_cnn_batch12_4_vgg_layer4_lr0.1 --epochs 10 --cuda --batch-size 4 --lr 0.1 --save-folder save/ --save-every 1 --feat_extractor vgg_cnn --dropout 0.1 --num-layers 4 --num-heads 8 --dim-model 512 --dim-key 64 --dim-value 64 --dim-input 161 --dim-inner 2048 --dim-emb 512 --shuffle --min-lr 1e-6 --k-lr 1 --parallel --device-ids 0 1 2 3

================================================== THE EXPERIMENT LOG IS SAVED IN: log/librispeech_drop0.1_cnn_batch12_4_vgg_layer4_lr0.1 TRAINING MANIFEST: ['/home/hh1208-kang/asr/data/librispeech/libri_asis_10'] VALID MANIFEST: ['/home/hh1208-kang/asr/data/librispeech/libri_dev_10'] TEST MANIFEST: ['/home/hh1208-kang/asr/data/librispeech/libri_test_clean'] ================================================== load with device_ids [0, 1, 2, 3] 0%| | 0/2 [00:00<?, ?it/s]In Model: input size torch.Size([1, 128, 40, 395]) output size torch.Size([1, 395, 512]) torch.Size([1, 1000, 32]) In Model: input size torch.Size([1, 128, 40, 395]) output size torch.Size([1, 395, 512]) torch.Size([1, 1000, 32]) In Model: input size torch.Size([1, 128, 40, 395]) output size torch.Size([1, 395, 512]) torch.Size([1, 1000, 32]) In Model: input size torch.Size([1, 128, 40, 395]) output size torch.Size([1, 395, 512]) torch.Size([1, 1000, 32]) (Epoch 1) TRAIN LOSS:4.3895 CER:97.83% LR:0.0000010: 50%|████████████████████████████████████████████████ | 1/2 [00:11<00:11, 11.38s/it]In Model: input size torch.Size([1, 128, 40, 403]) output size torch.Size([1, 403, 512]) torch.Size([1, 1000, 32]) In Model: input size torch.Size([1, 128, 40, 403]) output size torch.Size([1, 403, 512]) torch.Size([1, 1000, 32]) In Model: input size torch.Size([1, 128, 40, 403]) output size torch.Size([1, 403, 512]) torch.Size([1, 1000, 32]) In Model: input size torch.Size([1, 128, 40, 403]) output size torch.Size([1, 403, 512]) torch.Size([1, 1000, 32]) (Epoch 1) TRAIN LOSS:4.3098 CER:95.95% LR:0.0000010: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00, 8.09s/it]

0%| | 0/3 [00:00<?, ?it/s]In Model: input size torch.Size([1, 128, 40, 312]) output size torch.Size([1, 312, 512]) torch.Size([1, 1000, 32]) In Model: input size torch.Size([1, 128, 40, 312]) output size torch.Size([1, 312, 512]) torch.Size([1, 1000, 32]) In Model: input size torch.Size([1, 128, 40, 312]) output size torch.Size([1, 312, 512]) torch.Size([1, 1000, 32]) In Model: input size torch.Size([1, 128, 40, 312]) output size torch.Size([1, 312, 512]) torch.Size([1, 1000, 32]) VALID SET 0 LOSS:4.2888 CER:106.35%: 33%|█████████████████████████████████████▎ | 1/3 [00:00<00:01, 1.74it/s]In Model: input size torch.Size([1, 128, 40, 735]) output size torch.Size([1, 735, 512]) torch.Size([1, 1000, 32]) In Model: input size torch.Size([1, 128, 40, 735]) output size torch.Size([1, 735, 512]) torch.Size([1, 1000, 32]) In Model: input size torch.Size([1, 128, 40, 735]) output size torch.Size([1, 735, 512]) torch.Size([1, 1000, 32]) In Model: input size torch.Size([1, 128, 40, 735]) output size torch.Size([1, 735, 512]) torch.Size([1, 1000, 32]) VALID SET 0 LOSS:4.2631 CER:107.82%: 67%|██████████████████████████████████████████████████████████████████████████▋ | 2/3 [00:00<00:00, 2.08it/s]In Model: input size torch.Size([1, 128, 40, 457]) output size torch.Size([1, 457, 512]) torch.Size([1, 1000, 32]) In Model: input size torch.Size([1, 128, 40, 457]) output size torch.Size([1, 457, 512]) torch.Size([1, 1000, 32]) Traceback (most recent call last): File "train.py", line 122, in label2id, id2label, metrics) File "/home/hh1208-kang/end2end-asr-pytorch-user/trainer/asr/trainer.py", line 145, in train pred, gold, hyp_seq, gold_seq = model(src, src_lengths, tgt, verbose=False) File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, **kwargs) File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/_utils.py", line 394, in reraise raise self.exc_type(msg) TypeError: Caught TypeError in replica 2 on device 2. Original Traceback (most recent call last): File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, **kwargs) File "/home/hh1208-kang/venv/lib/python3.5/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, **kwargs) TypeError: forward() missing 3 required positional arguments: 'padded_input', 'input_lengths', and 'padded_target'

paanguin avatar Feb 05 '20 06:02 paanguin

I also met the problem, and then i try to modify the code of BucketingSampler in dataloader.py, in the init function, i drop the last batch if the last batch size is smaller than the specific batch size.

ArtemisZGL avatar Feb 18 '20 02:02 ArtemisZGL

I also met the problem, and then i try to modify the code of BucketingSampler in dataloader.py, in the init function, i drop the last batch if the last batch size is smaller than the specific batch size.

Yes, I am using similar solution. I modified the codes not to use the BucketingSampler, by initializing AudioDataLoader as follows: train_loader = AudioDataLoader( train_data, num_workers=args.num_workers, batch_size=args.batch_size, shuffle=args.shuffle, drop_last=True)

valid_loader = AudioDataLoader(valid_data, num_workers=args.num_workers, batch_size=args.batch_size, drop_last=True)

Yet I'm still curious about how can I use the last batch, since the tutorial seems to do so: https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html.

paanguin avatar Feb 18 '20 02:02 paanguin

While I'm doing other updates on my own version of the codes, I think I found the solution. I modified the program to accept batch_size as the maximum number of frames. During the modification, I found similar multi-GPU problem, the same error occurred when any GPU device 'starves', means it does not have any sample to process.

The problem were appeared in those function callings. https://github.com/gentaiscool/end2end-asr-pytorch/blob/a22efdda28f1689206eefb3fedbb56fbcc98a753/trainer/asr/trainer.py#L58 https://github.com/gentaiscool/end2end-asr-pytorch/blob/a22efdda28f1689206eefb3fedbb56fbcc98a753/trainer/asr/trainer.py#L139

It seems that the error can be avoided by changing the function callings as follows: pred, gold, hyp_seq, gold_seq = model(src, src_lengths, tgt) I don't know the reason why this fixed the problem. I only guess that the way of naming function parameters in torch/nn/parallel/parallel_apply.py may not be consistent with the callings. Anyway, it seems the modifications works for me. I cannot guarantee whether it works for the original version of the codes or not.

paanguin avatar Mar 16 '20 00:03 paanguin