melgan
melgan copied to clipboard
Batch size = 16?
Hi, Thank you for your nice implementation. I have a question about the batch size selection. It looks like the network is small enough for bigger batch size, for example 32 or 64 on a GTX 1080Ti. Batch size of 16 is a kind of regularization? Another question is related to the G/D updates. In your generated samples, are you using 1:1? Thanks.
Hi, @chapter544
- The original authors noted that selecting batch size is important, so I didn’t increase the batch size even if I could use bigger batch size. Currently I’m testing whether using larger batch size is harmful or not.
- I used 1:1.
@seungwonpark thank you for your information. I'll leave this issue a little longer so that we can confirm about the batch size selection.
Training loss curve of internal multi-speaker dataset. The batch size is 16(orange), 64(blue). I can't determine whether 64 is okay for now...
@seungwonpark Sorry but I couldn't find the note in the original paper that batch size was carefully chosen. Also, I've thinking that if we use multi-speaker training scheme and use larger batch, then it can include more modes in training batch so that it can helps training.(also discussed in https://arxiv.org/abs/1809.11096, but not sure because it may depends on domain)
@wade3han Actually it wasn't noted on the paper, but there was a TeX comment like:
%Batch size was an important hyper-parameter that required tuning to find optimal audio fidelity and faster training time. We used batch size 16 for all experiments.
You can see the LaTeX source of the original paper at: https://arxiv.org/format/1910.06711
I tried batch size: (32, 128, 256), with similar configuration of this repo,
batch 32 was better than others at 220k train step ( 32 > 128 > 256).
I haven't tried batch size = 16
yet.
Is it obvious that mel-gan works best at batch size 16? I reminded the mention of authors and now it sounds like they realize there are trade-offs between audio fidelity and faster training time; so if we took more time to train the model with larger batch-size, maybe we can get higher audio quality.
I just experimentally found batch 16 was best with learning_rate: (discriminator: 4e-4, generator: 1e-4 gan training technique? called ttur was beneficial) if other hyper params are fixed.