MeloTTS Training vs tensorboard metrics

Will my training yield better results over time? Currently, the training took about 9 hours. I have 1500 wav samples, with a total audio length of approximately 2 hours.

Screenshot 2024-11-08 at 11 53 27

What other metrics should I pay attention to in TensorBoard?

Nov 08 '24 10:11 smlkdev

Update after ~34h: Little improvement visible but note sure should I keep it longer because of the flattening.

Screenshot 2024-11-09 at 10 41 25

Nov 09 '24 09:11 smlkdev

We usually look at g/total, and from your graph, it seems to be decreasing pretty well. But I’m not sure if 2 hours of training data is enough; I initially used around 8 to 10 hours for training.

Nov 09 '24 13:11 jeremy110

We usually look at g/total, and from your graph, it seems to be decreasing pretty well. But I’m not sure if 2 hours of training data is enough; I initially used around 8 to 10 hours for training.

@jeremy110 Thank you for your response! I’m honestly a bit hooked on watching the progress as it keeps going down, so I can’t seem to stop checking in :-)

Currently at 68 hours.

Screenshot 2024-11-10 at 22 05 29

I’m planning to create an 8-10 hour audio dataset for the next training session. Could you suggest what kind of text data I should gather for it? So far, I’ve used random articles and some ChatGPT-generated data, but I’ve heard that people sometimes read books, for example. Is there perhaps a dataset available with quality English sentences that covers a variety of language phenomena? I tried to find it but no results.

Nov 10 '24 21:11 smlkdev

@smlkdev Basically, this training can be kept short since it’s just a fine-tuning session; no need to make it too long. Here’s my previous tensorboard log for your reference(https://github.com/myshell-ai/MeloTTS/issues/120#issuecomment-2105728981).

I haven’t specifically researched text types. My own dataset was professionally recorded, with sentences that resemble reading books. I’m not very familiar with English datasets—are you planning to train in English?

Nov 11 '24 01:11 jeremy110

This is my first attempt with ML/training/voice cloning and decided to use english. I read briefly Thai thread and it was way too complex for me to start with.

Your training was 32 hours long and for me (I'm not the expert) infer voice matched original :) That's really nice. Is it a voice that had 8-10 hours of audio as you mentioned earlier?

Nov 11 '24 10:11 smlkdev

Yes, that's correct. I tried both single-speaker and multi-speaker models, and the total duration is around 8-10 hours.

If this is your first time getting into it, I recommend you try F5-TTS. There are a lot of people in the forums who have trained their own models, and some even wrote a Gradio interface, which is very convenient.

Nov 11 '24 12:11 jeremy110

@jeremy110 thank your for your responses.

Is F5-TTS better than MeloTTS in terms of quality?

I just realized that my cloned MeloTTS voice doesn’t add breaks between sentences. I have to add them manually—by splitting the text into sentences, breaking it down into smaller parts, generating and then merging it back together after adding pauses. It can be made automatically of course but still a bit of work. (I was focusing on single sentences before and I liked the quality)

Nov 12 '24 19:11 smlkdev

In terms of quality, I think F5-TTS is quite good. You can try it out on the Huggingface demo.

The pauses within sentences mainly depend on your commas (","). The program adds a space after punctuation to create a pause. However, if the audio files you trained on have very little silence before and after the speech, the generated audio will also have little silence. Of course, you can add the pauses manually, but you could also address it by adjusting the training data.

Nov 13 '24 01:11 jeremy110

@smlkdev I am training the melotts model with sentiment data. But I couldn't get the tensorboard graphs to work. Can you share your sample code?

Nov 19 '24 09:11 kadirnar

These are written in the train.log file. Training is still ongoing. Are these important?

2024-11-19 09:22:10,339	example	ERROR	enc_p.language_emb.weight is not in the checkpoint
2024-11-19 09:22:10,340	example	ERROR	emb_g.weight is not in the checkpoint

Nov 19 '24 09:11 kadirnar

@smlkdev I am training the melotts model with sentiment data. But I couldn't get the tensorboard graphs to work. Can you share your sample code?

I used the simplest cmd possible:

tensorboard --logdir PATH where PATH is a logs folder inside ...MeloTTS/melo/logs/checkpoint_name (pointing to folder with checkpoints)

Nov 19 '24 11:11 smlkdev

@jeremy110 Hello, I would like to inquire about the data preparation process when training on multiple speakers. Is it necessary for each speaker to have a comparable amount of data? For instance, if Speaker A has 10 hours of audio and Speaker B only has 1 hour, is it possible to create a good model, or does Speaker B also require approximately 10 hours of audio? Thank you

Nov 21 '24 04:11 manhcuong17072002

@manhcuong17072002 Hello~ In my training, some speakers had 1 or 2 hours of audio, while others had 30 minutes, and in the end, there were about 10 hours of total data. I was able to train a decent model, but for speakers with less data, their pronunciation wasn't as accurate.

Nov 21 '24 05:11 jeremy110

@jeremy110 Oh, if that's the case, that's wonderful. Collecting data and training the model will become much easier with your idea. So, when training, you must have used many speaker IDs, right? And do you find their quality sufficient for deployment in a real-world environment? I'm really glad to hear your helpful feedback. Thank you very much!

Nov 21 '24 13:11 manhcuong17072002

@manhcuong17072002

Yes, there are about 15 speakers. Of course, if you have enough people, you can continue to increase the number. After 10 hours, the voice quality is quite close, but if you want better prosody, you might need more speakers and hours.

From the TTS systems I've heard, the voice quality is about above average, but when it comes to deployment, you need to consider inference time. For this, MeloTTS is quite fast.

Nov 21 '24 14:11 jeremy110

@jeremy110 Thank you for the incredibly helpful information. Let me summarize a few points:

Training data: 8 to 10 hours of audio is sufficient to train the MeloTTS model, and more data is always welcome.
Number of speakers: 15. 30 minutes to 2 hours of data per speaker yields good results. More data per speaker leads to better results.
Deployment speed: MeloTTS is relatively fast to deploy.

However, I've experimented with various TTS models and noticed that if the text isn't broken down into smaller chunks, the generated speech quality degrades towards the end of longer passages. Have you tested this with MeloTTS? If so, could you share your experimental process? Thank you so much.

Nov 21 '24 14:11 manhcuong17072002

@manhcuong17072002 You're welcome, your conclusion is correct.

Normally, during training, long audio files are avoided to prevent GPU OOM (Out of Memory) issues. Therefore, during inference, punctuation marks are typically used to segment the text, ensuring that each sentence is closer to the length used during training for better performance. MeloTTS performs this segmentation based on punctuation during inference, and then concatenates the individual audio files after synthesis.

Nov 22 '24 01:11 jeremy110

@jeremy110 I'm so sorry but I suddenly have a question about training on a multiple speakers dataset. Is it possible for Speaker A to pronounce words that exist in other speakers but not in A? Because if not, dividing the dataset into multiple speakers would be pointless and the model would not be able to cover the entire vocabulary of a language. Have you tried this before and what are your thoughts on this? Thank you.

Nov 22 '24 09:11 manhcuong17072002

@manhcuong17072002 If we consider 30 minutes of audio, assuming each word takes about 0.3 seconds, there would be around 5000–6000 words. These words would then be converted into phoneme format, meaning they would be broken down into their phonetic components for training. With 6000 words, the model would learn most of the phonemes. However, when a new word is encountered, it will be broken down into the phonemes it has already learned. I haven't done rigorous testing, but in my case, the model is able to produce similar sounds.

Nov 22 '24 11:11 jeremy110

@jeremy110 Thanks for your useful information

Nov 22 '24 14:11 manhcuong17072002

@jeremy110 Even if latest messages were not addressed directly to me - I want to thank you as well, you are giving me a fresh point of view how to look at dataset.

Using your suggestion, I'm currently testing how F5-TTS works and what effects it produces. Here are my charts (same dataset as in this thread). Should I stop fine-tuning now that the LR has dropped so low or can it still improve nicely?

Screenshot 2024-11-22 at 18 36 31

And how should I choose the right/best checkpoint? Is it a matter of listening to the dialogue, or should I rely on the loss from the chart, i.e., lower = better? Of course it is hard to get very exact checkpoint when I'm doing save every 2.5k steps.
If I wanted to create a short dataset from scratch, how many times should each word appear in the audio recordings to make it meaningful? Would just once be enough? I imagine that if I were creating a dataset to perform exceptionally well in a specific field, like "cooking," I could build one using words more frequently used in that domain, such as "flour," "knife," or "tomato." I’m also guessing that the sentences in the dataset should include the most commonly used words in English along with niche-specific words, so my model performs as well as possible in that area (though it might struggle in something like "aviation" niche).

Nov 22 '24 17:11 smlkdev

@smlkdev

Basically, fine-tuning usually requires less than 10 epochs; the settings from others are used as a reference.
Typically, a model trained for 10 epochs is already good for use, and the last one is usually the best.
If you're training a new language, a small dataset is not ideal; it still needs to be of a certain length to achieve better learning.

Since I'm not sure if you're training a new language, if you are, I tested small datasets and found that around 20–40 hours of audio is necessary, with each clip lasting 2–10 seconds. It can be either multiple speakers or a single speaker, but zero-shot performance is poor with a single speaker. Additionally, I tested with 350 hours of audio and about 300 speakers, and the zero-shot performance was excellent. Using my own voice as a reference, the synthesized voice was very close to my own. Finally, for F5-TTS to generate good audio, the reference is crucial. In my tests, 6–10 second clips worked best. So, for your third question, you can include these specific words in your reference, and the inference quality for this domain will improve.

Nov 23 '24 02:11 jeremy110

@jeremy110

https://huggingface.co/datasets/reach-vb/jenny_tts_dataset I am training this dataset. I made 10 epochs from the config settings. There is epoch 30 in the train.log file and it is still training. What should I fix?

2024-11-24 06:22:09,477	example	INFO	Train Epoch: 31 [53%]
2024-11-24 06:22:09,477	example	INFO	[2.2495381832122803, 3.04087495803833, 9.244926452636719, 18.031190872192383, 1.9427911043167114, 2.0941860675811768, 100200, 0.0002988770366855993]
2024-11-24 06:23:10,591	example	INFO	Train Epoch: 31 [59%]
2024-11-24 06:23:10,592	example	INFO	[2.145620107650757, 3.066821336746216, 9.333406448364258, 19.36675453186035, 2.052659511566162, 2.4836974143981934, 100400, 0.0002988770366855993]
2024-11-24 06:24:11,103	example	INFO	Train Epoch: 31 [65%]
2024-11-24 06:24:11,104	example	INFO	[2.5389487743377686, 2.33595871925354, 7.182312488555908, 19.055206298828125, 1.9395025968551636, 1.7028437852859497, 100600, 0.0002988770366855993]

How do you think the graphs are? Can you interpret them?

Nov 24 '24 06:11 kadirnar

@kadirnar hello~ Typically, the parameters would be read from the YAML file, but it's okay. You can stop it at the appropriate time. I’ve kind of forgotten how many steps I originally set for training, but you can refer to my TensorBoard.

I’m not very experienced in the TTS field, but for MeloTTS, it mainly uses loss functions common in GANs, involving the Discriminator and Generator. Personally, I check the loss/g/total and loss/g/mel to assess whether the training results are as expected.

From your graphs, since there is no loss/g/total, I cannot judge the result. From my own training, the values typically range from 45 to 60, depending on your dataset.

Nov 24 '24 13:11 jeremy110

@jeremy110 What do you think if I have a small dataset and I combine the audios in the dataset to create a larger one? For example, initially we have 2 audios A and B. We combine the 2 audios as follows: Audio A + Audio B = Audio C, from which we get 3 audios A, B, C. Do you think this will significantly affect the training results compared to the original small dataset? Thank you.

Nov 25 '24 09:11 manhcuong17072002

@manhcuong17072002

This approach can indeed enhance the data and may provide a slight improvement, but several points need to be considered. Since MeloTTS uses BERT to extract feature vectors, if we randomly concatenate the text of two audio files and then extract feature vectors, can it still effectively represent the prosody of the text? You could try it out and see how it performs.

Additionally, you can refer to the Emilia-Dataset (https://huggingface.co/datasets/amphion/Emilia-Dataset) used by F5-TTS. It has a process for generating the final training data, which you might consider using as a method to collect data.

Nov 25 '24 11:11 jeremy110

How to get the tensorboard metrics when I started the training?

Jan 21 '25 21:01 yukiarimo

MeloTTS MeloTTS copied to clipboard

Training vs tensorboard metrics

MeloTTS
MeloTTS copied to clipboard