Real-Time-Voice-Cloning icon indicating copy to clipboard operation
Real-Time-Voice-Cloning copied to clipboard

Can't train on two GPU's

Open zxmanxz opened this issue 4 years ago • 21 comments

Hi, when I tried to train synthesizer model on my laptop with 1 Nvidia 1650 GPU all was good but when I tried to run training process on my server with two Nvidia GeForce 1080Ti I got an error: ` ╰─ python synthesizer_train.py pretrained_new datasets/SV2TTS/synthesizer -s 50 -b 50 ─╯

Arguments: run_id: pretrained_new syn_dir: datasets/SV2TTS/synthesizer models_dir: synthesizer/saved_models/ save_every: 50 backup_every: 50 force_restart: False hparams:

Checkpoint path: synthesizer/saved_models/pretrained_new/pretrained_new.pt Loading training data from: datasets/SV2TTS/synthesizer/train.txt Using model: Tacotron Using device: cuda

Initialising Tacotron Model...

Trainable Parameters: 30.870M

Loading weights at synthesizer/saved_models/pretrained_new/pretrained_new.pt Tacotron weights loaded from step 0 Using inputs from: datasets/SV2TTS/synthesizer/train.txt datasets/SV2TTS/synthesizer/mels datasets/SV2TTS/synthesizer/embeds Found 259 samples +----------------+------------+---------------+------------------+ | Steps with r=2 | Batch Size | Learning Rate | Outputs/Step (r) | +----------------+------------+---------------+------------------+ | 20k Steps | 12 | 0.001 | 2 | +----------------+------------+---------------+------------------+

Traceback (most recent call last): File "synthesizer_train.py", line 35, in train(**vars(args)) File "/home/roma/new_Real-Time-Voice-Cloning/Real-Time-Voice-Cloning/synthesizer/train.py", line 175, in train mels, embeds) File "/home/roma/new_Real-Time-Voice-Cloning/Real-Time-Voice-Cloning/synthesizer/utils/init.py", line 17, in data_parallel_workaround outputs = torch.nn.parallel.parallel_apply(replicas, inputs) File "/home/roma/anaconda3/envs/work/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/home/roma/anaconda3/envs/work/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise raise self.exc_type(msg) StopIteration: Caught StopIteration in replica 0 on device 0. Original Traceback (most recent call last): File "/home/roma/anaconda3/envs/work/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, **kwargs) File "/home/roma/anaconda3/envs/work/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/roma/new_Real-Time-Voice-Cloning/Real-Time-Voice-Cloning/synthesizer/models/tacotron.py", line 362, in forward device = next(self.parameters()).device # use same device as parameters StopIteration

`

zxmanxz avatar Feb 16 '21 16:02 zxmanxz

My ability to help with this is limited, since I don't have a server with multiple GPUs to test.

Let's see if the data_parallel_workaround is not required. In synthesizer/train.py, try copying the code from line 177 over to line 174. https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/10ca8f7c785707f21c78cfe858a97841c1d875ba/synthesizer/train.py#L172-L177

ghost avatar Feb 16 '21 16:02 ghost

And paste where? Also I tried to set CUDA_VISIBLE_DEVICES=0 (to get only one GPU) but problem was the same...

zxmanxz avatar Feb 16 '21 17:02 zxmanxz

If you get an identical message on a single GPU, then something is wrong because it shouldn't be executing the multi-GPU code.

Why don't you try setting CUDA_VISIBLE_DEVICES inside synthesizer_train.py? (This file is in the root of the repo, unlike train.py) See https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/489#issuecomment-673358005 . Then run the original, unmodified code. Paste the error message if you get one.

ghost avatar Feb 16 '21 18:02 ghost

It works fine with single GPU, mb you can give me an advise how to get full GPU usage (e.g. now it just using 4 GB and the other 7 are available)

zxmanxz avatar Feb 16 '21 19:02 zxmanxz

To increase VRAM usage, adjust the batch size parameter (far right number) in hparams. https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/10ca8f7c785707f21c78cfe858a97841c1d875ba/synthesizer/hparams.py#L52-L57

ghost avatar Feb 16 '21 20:02 ghost

Thank you, if there would be the way to parallel computation between many GPU, it would be great.

zxmanxz avatar Feb 16 '21 20:02 zxmanxz

@zxmanxz Try this branch for multi-GPU training. If it works I will submit a pull request. https://github.com/blue-fish/Real-Time-Voice-Cloning/tree/664_multi_gpu_training

ghost avatar Feb 17 '21 01:02 ghost

@zxmanxz Can you let me know if the multi-GPU branch above works for you?

ghost avatar Feb 18 '21 19:02 ghost

Yes, I'll try to use multi GPU's later.

zxmanxz avatar Feb 19 '21 09:02 zxmanxz

@zxmanxz When will you be able to test the multi-GPU training code? https://github.com/blue-fish/Real-Time-Voice-Cloning/tree/664_multi_gpu_training

ghost avatar Feb 28 '21 06:02 ghost

@blue-fish In the above mentioned code, we get another error at Line no 110 in synthesizer/train.py. torch.nn.modules.module.ModuleAttributeError: 'DataParallel' object has no attribute 'load'

chayan-agrawal avatar Mar 02 '21 11:03 chayan-agrawal

@chayan-agrawal Which version of torch are you using? I'm using torch==1.7.1 and don't get that error.

ghost avatar Mar 02 '21 15:03 ghost

@blue-fish I am also using torch==1.7.1. Instead of model.load if used model.module.load it works on single GPU. Other GPUs are not in use.

chayan-agrawal avatar Mar 02 '21 16:03 chayan-agrawal

@chayan-agrawal Thanks for suggesting that change. I have updated the code with your suggestion: https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/a90d2340c0d0c416bcec4da089a2c9ce3e4ed7d4

DataParallel also works in CPU and single GPU environment, so it is not necessary to check for multiple GPUs. It would be nice to get feedback on whether it works for multiple GPUs.

ghost avatar Mar 02 '21 16:03 ghost

Before we even think of merging this code, we'll need to consider these issues:

  • Parallel training on multiple GPUs introduces a lot of overhead, slowing training: https://github.com/as-ideas/ForwardTacotron/issues/9
  • Using DataParallel may result in incorrect gradient computation on multiple GPUs: https://github.com/pytorch/pytorch/issues/15716 (might be fixed in torch >= 1.4.0, but issue is still open)

ghost avatar Mar 02 '21 18:03 ghost

@blue-fish I have multiple GPUs on my system but it is working for only single GPU. Any help on how can I use multiple GPUs

chayan-agrawal avatar Mar 02 '21 18:03 chayan-agrawal

@chayan-agrawal I don't have a multiple GPU environment to troubleshoot. All I can suggest is to ensure that Python sees both of your GPUs. For example, add this to the beginning of synthesizer_train.py to have it use the first and second GPUs.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

ghost avatar Mar 04 '21 18:03 ghost

You might have to downgrade to torch==1.4.0 to get DataParallel to work.

  • https://github.com/fatchord/WaveRNN/issues/189
  • https://github.com/huggingface/transformers/issues/3936

ghost avatar Mar 07 '21 06:03 ghost

You might have to downgrade to torch==1.4.0 to get DataParallel to work.

Hey, I'm having issues with this as well.. I feel it's just something stupid-simple I'm overlooking and is an easy fix though if you are able to get it working.. :)

The original repo worked fine with 1 GPU using Torch 1.6.0 before I installed a 2nd GPU to speed up training.. I am using torch 1.7.1 like you said was working for you. Torch version 1.4.0 does not actually run.

I have tried using both the CorentinJ and your blue-fish repo forks (yours being the one you had suggested which was the branch for multi-GPU support). The main repo does not run with Torch 1.4.0, 1.6.0, nor 1.7.1 unless I remove the second GPU from the system. Your repo branch I mentioned does work with the enviroment path override added and using Torch 1.7.1.. however it does not actually utilize the second GPU.

Is there a requirements.txt that you can provide for testing? Perhaps I have some other library installed which breaks this functionality? I'm grasping at straws at this point.. I have been working at it for days but to no avail.. Didn't want to post here until I felt that I needed assistance.

Kind regards.

Synergyst avatar Jul 02 '21 05:07 Synergyst

i am having the exact same problem, has anyone solved it somehow?

fede-astolfi avatar Sep 03 '21 16:09 fede-astolfi

You might have to downgrade to torch==1.4.0 to get DataParallel to work.

As Synergyst mentioned, using torch version 1.4 dosen't work. The error i got is: "AttributeError: 'PosixPath' object has no attribute 'tell'" I googled it and find that to solve it i have to use torch version above 1.6. Awkward face...

linan06kuaishou avatar Nov 30 '21 09:11 linan06kuaishou