detail_tts information on 24khz model

hey @adelacvg thank for sharing the code

after reading the code i want to ask you few question about new 24k model if you dont mind

what make different about this model from previous one (https://huggingface.co/adelacvg/Detail/tree/main) beside sample rate
did you not use speech encoder in 24k model? (i see there is speech encoder in utils.like hubert , whisper etc, but i think is from previous model), did you also still use ContentVec768L12.py ?
i see train_target in (https://github.com/adelacvg/detail_tts/blob/master/vqvae/configs/config_24k.json) , i assume it has multpile step of training, if i want to train from scratch , do i need to change it ? say "gpt" first, flowqae , and diff ( is this correct ?)
if i want to train scratch i just remove (https://github.com/adelacvg/detail_tts/blob/7e2466855f401637fe94f39c185121990f679f31/train.py#L461) right?

sorry if is this a lot question,, thanks in advance

Aug 19 '24 15:08 acul3

There are many differences, such as the method of adding speaker information in diffusion, the approach to normalization, and which latent feature to use, among others. All these changes were made to create a more stable and hifi model.
No SSL features like Whisper or cevc were used; these were merely copied from other projects.
Yes, the training sequence is "flowvae" -> "vqvae" -> "gpt" -> "diff". The reason for adding step-by-step training is that it allows for better gradient accumulation, which is crucial for training VQVAE and GPT.
Yes, to train from scratch, you just need to remove the load code. Please make sure to pre-process the data into a text-audiopath pair format in advance. Datasets part is written very simply, so it should be easy to modify.

Aug 19 '24 15:08 adelacvg

There are many differences, such as the method of adding speaker information in diffusion, the approach to normalization, and which latent feature to use, among others. All these changes were made to create a more stable and hifi model.

No SSL features like Whisper or cevc were used; these were merely copied from other projects.

Yes, the training sequence is "flowvae" -> "vqvae" -> "gpt" -> "diff". The reason for adding step-by-step training is that it allows for better gradient accumulation, which is crucial for training VQVAE and GPT.

Yes, to train from scratch, you just need to remove the load code. Please make sure to pre-process the data into a text-audiopath pair format in advance. Datasets part is written very simply, so it should be easy to modify.

thanks you for your quick answer @adelacvg

one last question if you dont mind,, for point number 3 , is there specific config( for each target layers,dimension etc) especially for flowvae, i see there are specific config for gpt and diff

thanks once again,

i am planning to reproduce your code, but using multilingual language(english and malay),, need to train bpe first

Aug 19 '24 16:08 acul3

There are many differences, such as the method of adding speaker information in diffusion, the approach to normalization, and which latent feature to use, among others. All these changes were made to create a more stable and hifi model.

No SSL features like Whisper or cevc were used; these were merely copied from other projects.

Yes, the training sequence is "flowvae" -> "vqvae" -> "gpt" -> "diff". The reason for adding step-by-step training is that it allows for better gradient accumulation, which is crucial for training VQVAE and GPT.

Yes, to train from scratch, you just need to remove the load code. Please make sure to pre-process the data into a text-audiopath pair format in advance. Datasets part is written very simply, so it should be easy to modify.

thanks you for your quick answer @adelacvg

one last question if you dont mind,, for point number 3 , is there specific config( for each target layers,dimension etc) especially for flowvae, i see there are specific config for gpt and diff

thanks once again,

i am planning to reproduce your code, but using multilingual language(english and malay),, need to train bpe first

For vqvae and flowvae specific config, you can check config_24k.json vaegan part. For multilingual, you can use the voice_tokenizer.py to train your custom bpe tokenizer.

Aug 20 '24 02:08 adelacvg

just finsih 50% step of flowvae ( 13M samples,300k of 600k step)

for the next step training (vqvae) i need to load the flowvae model .pt right ? @adelacvg and then continue training target

here sample from flowvae : https://github.com/user-attachments/assets/a0b5151e-e13a-4f5f-86bc-e38edb4ead2a

Aug 20 '24 08:08 acul3

Yes, just use the results from the previous step for the next step of the training.

Aug 21 '24 18:08 adelacvg

hmm it seems my vqae training loss stuck, after 2 days, it stay the same,, the sample also not intagible from ground truth

Yes, just use the results from the previous step for the next step of the training.

Aug 24 '24 18:08 acul3

hmm it seems my vqae training loss stuck, after 2 days, it stay the same,, the sample also not intagible from ground truth

Yes, just use the results from the previous step for the next step of the training.

It's normal; VQ-VAE only needs to capture the semantics approximately.

Aug 28 '24 15:08 adelacvg

hmm it seems my vqae training loss stuck, after 2 days, it stay the same,, the sample also not intagible from ground truth

Yes, just use the results from the previous step for the next step of the training.

It's normal; VQ-VAE only needs to capture the semantics approximately.

ok, i am at gpt stage now, after training 2 days,, the result now looks like ground truth,,but still not there

ground truth: https://github.com/user-attachments/assets/5c27cf96-7921-4ca1-af1f-dc8d2050bfe2

sample:

https://github.com/user-attachments/assets/6ecc76d3-1727-40cf-80c4-8fe353bf3ce6

Aug 30 '24 05:08 acul3

@adelacvg

btw i change my vocab size gpt to 512, due multilinguality

i just change the config

  "gpt":{
    "model_dim":768,
    "max_mel_tokens":1600,
    "max_text_tokens":800,
    "heads":16,
    "mel_length_compression":1024,
    "use_mel_codes_as_input":true,
    "layers":10,
    "number_text_tokens":513,
    "number_mel_codes":8194,
    "start_mel_token":8192,
    "stop_mel_token":8193,
    "start_text_token":512,
    "train_solo_embeddings":false,
    "spec_channels":128

number_text_tokens and start_text_token

it is correct right?

thank you again

Aug 30 '24 06:08 acul3

@adelacvg

btw i change my vocab size gpt to 512, due multilinguality

i just change the config

  "gpt":{
    "model_dim":768,
    "max_mel_tokens":1600,
    "max_text_tokens":800,
    "heads":16,
    "mel_length_compression":1024,
    "use_mel_codes_as_input":true,
    "layers":10,
    "number_text_tokens":513,
    "number_mel_codes":8194,
    "start_mel_token":8192,
    "stop_mel_token":8193,
    "start_text_token":512,
    "train_solo_embeddings":false,
    "spec_channels":128

number_text_tokens and start_text_token

it is correct right?

thank you again

In the GPT step, the infering results are close to those of VQ-VAE. You just need to ensure that the semantics are correct, and after diffusion, they will become high quality.

Aug 31 '24 15:08 adelacvg

Ensure that the referenced mel is a short segment of audio to avoid GPT overfitting on the speaker's conditions. I have updated some parameters of the VQ-VAE, resulting in a higher codebook utilization, which should lead to better results.

Aug 31 '24 15:08 adelacvg

@adelacvg btw , how can i infer diffusion part?, it seems api.py only provide vqvae and gpt (old commit) only

finishing gpt train and continue diff now,

Sep 07 '24 18:09 acul3

Infer_diffusion function is the same as the infer function, do_spectrogram_diffusion part do the sample process.

Sep 11 '24 13:09 adelacvg

@adelacvg have you got good result?

training diff 2 days i got same result as gpt(robotic sound but semantic is there)

Sep 16 '24 05:09 acul3

after using last commit , i finally got good result,,thank you

any tips how to make infer faster @adelacvg ? (maybe like tortoise sytle)

Sep 18 '24 16:09 acul3

For the GPT part, you can use acceleration frameworks similar to VLM, and they also support GPT2. For the diffusion part, you can adopt faster sampling methods with fewer sampling steps. Alternatively, like XTTS, you can use GANs instead of diffusion, although performance may decrease, it can achieve very fast results for the timbre in the training dataset.

Sep 18 '24 17:09 adelacvg