VALL-E-X
VALL-E-X copied to clipboard
Generated audio quality is low when making custom prompt
When I generate a prompt using a wav file, the generated audio is garbled.
I generated the prompt:
from utils.prompt_making import make_prompt
transcript = "I don't oppose war in all circumstances. And when I look over this crowd today I know there is no shortage of patriots or patriotism. What I do oppose is a dumb war."
make_prompt(name="obama_1", audio_prompt_path="DATA/barackobamafederalplaza.wav", transcript=transcript)
I generated audio samples:
from utils.generation import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
import torch
preload_models()
text_prompt = """
Hello, my name is Nose. And uh, and I like hamburger. Hahaha... But I also have other interests such as playing tactic toast.
"""
audio_array = generate_audio(text_prompt,'obama_1')
write_wav("DATA/obama_1_nose.wav",SAMPLE_RATE, audio_array)
text_prompt = "I don't oppose war in all circumstances. And when I look over this crowd today I know there is no shortage of patriots or patriotism. What I do oppose is a dumb war."
audio_array = generate_audio(text_prompt,'obama_1')
write_wav("DATA/obama_1_oracle.wav",SAMPLE_RATE, audio_array)
Original audio: https://www.dropbox.com/scl/fi/tylm6c2ffaw8qbqfqsblc/barackobamafederalplaza.wav?rlkey=wvvq0yi22t4dgynx4ehn6lcc8&dl=0
Generated audio: https://www.dropbox.com/scl/fi/wqwgtfs4ao008kmc2e5np/obama_1_nose.wav?rlkey=yce5dt3gurpxt3sxxq4dadfr0&dl=0 https://www.dropbox.com/scl/fi/p05ns8jgsje3rt6l3wkuz/obama_1_oracle.wav?rlkey=x8ghpl4tcrb1fe001qqxt4j3r&dl=0
Prompt is too long, see FAQ for this issue
The wav clip the prompt is generated from is 12.66 seconds. I see in the FAQ that training is kept under 22 seconds. Moreover, the make_prompt function does not complain about the prompt duration (which I did when I tried a wav file that was too long).
Hello. Do you have any follow up advice to improve the quality from clones voice samples? This is not only prompt I used, but one example.
The wav clip the prompt is generated from is 12.66 seconds. I see in the FAQ that training is kept under 22 seconds. Moreover, the make_prompt function does not complain about the prompt duration (which I did when I tried a wav file that was too long).
TOTAL duration should be less than 22s, not each single piece of them Use 3s prompt usually delivers better result
I created a prompt using 3 seconds of audio. The generated audio has some problems.
https://www.dropbox.com/scl/fi/ewj55pxg9lpgtsf7e6ie7/barackobamafederalplaza_3s.wav?rlkey=py7e9vd88r3fxdl4m4nxok2ii&dl=0
https://www.dropbox.com/scl/fi/kcqn4zqmr8en5a9eqdo30/obama_2_nose.wav?rlkey=cgzcx79u89xbydbsmmofnhjrg&dl=0
C:\Users\profi\anaconda3\envs\valle\lib\site-packages\torch\nn\utils\weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Traceback (most recent call last):
File "c:\ai\VALL-E-X\launch-ui.py", line 629, in
Help!
Your prompt audio has very strong background noise. If you denoise it, it should get better result.
By the way, according to my very brief experience using this repo, only prompt audio of tts-level quality can produce audio with acceptable quality.
But still, the pronunciation is often far from "natural", I am not sure if it is related to the g2p module or merely due to the small amount of training data. I guess the data quantity is the most important factor.