Real-Time-Voice-Cloning icon indicating copy to clipboard operation
Real-Time-Voice-Cloning copied to clipboard

Not utilizing full GPU at training

Open TrycsPublic opened this issue 2 years ago • 2 comments

I have RTX 6000, training its not using 100%, at only 25%-0% bouncing, VRAM around 5G out of the available 50.1G.

Any suggestions?

Server is provided by: https://lambdalabs.com/cloud/dashboard/instances

TrycsPublic avatar May 25 '22 23:05 TrycsPublic

Something else is bottlenecking performance. Since it's a cloud server the first thing I would suspect is the filesystem. Make sure the dataset is hosted on some kind of fast, local storage like an SSD. If it's network-based storage, try copying it to /tmp or /dev/shm if it fits. To increase VRAM utilization you can increase batch size. This will usually have the effect of helping the model converge faster.

raccoonML avatar May 26 '22 07:05 raccoonML

  1. Is higher batch size always better in this case? [i’ve saw higher batch size can lead to increased error rate in a paper, bigger isn’t always better?]
  2. It’s SSD based storage, still not much improvement.

TrycsPublic avatar May 29 '22 20:05 TrycsPublic

I also trained the model on RTX A6000. I also find that the utilization only takes a little part during the training. And most of the time, the utilization is 0%. It seemed that there are other operations takes lots of time which may be the the bottleneck of the training process.

huskyachao avatar Dec 01 '22 12:12 huskyachao