VoiceCraft
VoiceCraft copied to clipboard
Tips to improve the quality of text to speech
Thanks for the great model!
Do you have any tips when using the model to clone voices for text to speech?
I'm converting the reference wav files to 16000 sample rate and the same format as the example wav file in the repo.
However, the performance of the model doesn't seem that great. It often can only mimic the general tone and gender of the reference and often has pauses or slurring.
I'm calling it like this:
def generate(self, wav_audio_file: pathlib.Path, audio_file_transcript: str, target_transcript: str) -> bytes:
# take a look at demo/temp/mfa_alignment, decide which part of the audio to use as prompt
target_transcript = f"{audio_file_transcript} {target_transcript}"
print(target_transcript)
# NOTE: 3 sec of reference is generally enough for high quality voice cloning, but longer is generally better, try e.g. 3~6 sec.
audio_file_path = str(wav_audio_file)
info = torchaudio.info(audio_file_path)
audio_dur = info.num_frames / info.sample_rate
# cut_off_sec = 4.01 # NOTE: according to forced-alignment file demo/temp/mfa_alignments/84_121550_000074_000000.csv, the word "common" stop as 3.01 sec, this should be different for different audio
# assert cut_off_sec < audio_dur, f"cut_off_sec {cut_off_sec} is larger than the audio duration {audio_dur}"
# prompt_end_frame = int(cut_off_sec * info.sample_rate)
prompt_end_frame = -1
# run the model to get the output
# hyperparameters for inference
codec_audio_sr = 16000
codec_sr = 50
top_k = 0
top_p = 0.8
temperature = 1
silence_tokens=[1388,1898,131]
kvcache = 1 # NOTE if OOM, change this to 0, or try the 330M model
# NOTE adjust the below three arguments if the generation is not as good
stop_repetition = 3 # NOTE if the model generate long silence, reduce the stop_repetition to 3, 2 or even 1
sample_batch_size = 4 # NOTE: if the if there are long silence or unnaturally strecthed words, increase sample_batch_size to 5 or higher. What this will do to the model is that the model will run sample_batch_size examples of the same audio, and pick the one that's the shortest. So if the speech rate of the generated is too fast change it to a smaller number.
seed = 1 # change seed if you are still unhappy with the result
seed_everything(seed)
Am I missing something? Thank you!
:+1: same question here!
I was able to produce some sounds but the quality is... mediocre? How can we improve it?
edit: changing the seed parameters and also making the target transcript to be only 1-2 sentences helped a bit. (longer sentences causes the pitch to change for some reason)
How long target transcript? The model is trained on short sentences (evarage length 5 sec, although the longest training data goes to 20sec), so you might want to finetune it on long utterances if that's you testing scenario
without finetuning, you could try increasing sample_batch_size and decreasing stop_repetition
In general, the current model is not trained to do TTS - it's trained to do speech editing, but it happens to generalize to TTS. I'm finetunig the model on a TTS objective, and will release that model soon
Thank you! I was using reference audio up to 12 seconds long + target transcript which is about 4 seconds long.
I’ll try using a reference which is about 4 seconds + target of 4 seconds? Does that sound ok?
Also, when doing text to speech, I just concatenate the reference transcript and target transcript together and set prompt_end_frame to -1. Is that the correct thing to do?
Thank you! I was using reference audio up to 12 seconds long + target transcript which is about 4 seconds long.
I’ll try using a reference which is about 4 seconds + target of 4 seconds? Does that sound ok?
Also, when doing text to speech, I just concatenate the reference transcript and target transcript together and set prompt_end_frame to -1. Is that the correct thing to do?
all sound good
Some times the speaker similarity can be a bit off, it's like the model uses a different voice than the prompt.
One thing that I found can improve speaker similarity in those situations is to make sure that the prompt is not an entire sentence, it should be instead an unfinished sentence, and therefore the model will better follow the voice
due to the noisy nature of gigaspeech, some of the training utterances have a speaker switch, i.e. two speakers takes turn to speak in the same training utterance.
The TTS finetuned 330M model is up, should be better than the 830M one
The TTS finetuned 330M model is up, should be better than the 830M one
Thank you for the release of the fine-tuned 330M TTS model. Its performance and efficiency are impressive. Your work is greatly appreciated, and I'm keen to see how it evolves to further support real-time use cases. Are there plans to develop future models with an emphasis on optimizing for real-time TTS applications?