Francesco
Francesco
Good question, I incur in the same problem. Haven't solved it yet, because we mostly use 1 gpu per process (democratically sharing them :D). This is probably better sought for...
I have not experimented yet but in general this should be hard but doable. The results will probably vary. You can also experiment with adding some speaker embeddings (concatenating along...
Sorry I don't understand the difference with the previous question. You can do the following: - train model from scratch and see what the results look like (very likely to...
This is something I always experienced when training forward models. Durations probably easily overfit. Practically has not been an issue, despite this making harder to understand the status of training...
Hi, not currently. It is something I'm working on.
Hi, 1. you can find conv layers replacing dense layers after attention in fastspeech, for example 2. we found that this helps building attention, although with more recent improvements might...
Hi, I trained the autoregressive models for about 600K steps (some less) and around the same for the forward models. This should take, if I remember correctly, about 2-3 days...
Hi, batch sizes are dynamic. Samples are bucketed by duration, so the batch size depends on how many samples there are in each bin. Max sizes are specified in the...
Maybe try 'espeak-ng' instead of 'espeakng'. Or visit http://espeak.sourceforge.net/
Hi, one quick thing you can try is switching from GPU to CPU by simply removing those lines. Unless you're predicting on a batch, batch size won't make any difference....