VALL-E-X icon indicating copy to clipboard operation
VALL-E-X copied to clipboard

Generated audio quality is low when making custom prompt

Open dchaws opened this issue 1 year ago • 7 comments

When I generate a prompt using a wav file, the generated audio is garbled.

I generated the prompt:

from utils.prompt_making import make_prompt
  
transcript = "I don't oppose war in all circumstances. And when I look over this crowd today I know there is no shortage of patriots or patriotism. What I do oppose is a dumb war."
make_prompt(name="obama_1", audio_prompt_path="DATA/barackobamafederalplaza.wav", transcript=transcript)

I generated audio samples:

from utils.generation import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
import torch
preload_models()
text_prompt = """
Hello, my name is Nose. And uh, and I like hamburger. Hahaha... But I also have other interests such as playing tactic toast.
"""
audio_array = generate_audio(text_prompt,'obama_1')
write_wav("DATA/obama_1_nose.wav",SAMPLE_RATE, audio_array)

text_prompt = "I don't oppose war in all circumstances. And when I look over this crowd today I know there is no shortage of patriots or patriotism. What I do oppose is a dumb war."
audio_array = generate_audio(text_prompt,'obama_1')
write_wav("DATA/obama_1_oracle.wav",SAMPLE_RATE, audio_array)

Original audio: https://www.dropbox.com/scl/fi/tylm6c2ffaw8qbqfqsblc/barackobamafederalplaza.wav?rlkey=wvvq0yi22t4dgynx4ehn6lcc8&dl=0

Generated audio: https://www.dropbox.com/scl/fi/wqwgtfs4ao008kmc2e5np/obama_1_nose.wav?rlkey=yce5dt3gurpxt3sxxq4dadfr0&dl=0 https://www.dropbox.com/scl/fi/p05ns8jgsje3rt6l3wkuz/obama_1_oracle.wav?rlkey=x8ghpl4tcrb1fe001qqxt4j3r&dl=0

dchaws avatar Oct 16 '23 17:10 dchaws

Prompt is too long, see FAQ for this issue

Plachtaa avatar Oct 17 '23 08:10 Plachtaa

The wav clip the prompt is generated from is 12.66 seconds. I see in the FAQ that training is kept under 22 seconds. Moreover, the make_prompt function does not complain about the prompt duration (which I did when I tried a wav file that was too long).

dchaws avatar Oct 17 '23 14:10 dchaws

Hello. Do you have any follow up advice to improve the quality from clones voice samples? This is not only prompt I used, but one example.

dchaws avatar Oct 18 '23 18:10 dchaws

The wav clip the prompt is generated from is 12.66 seconds. I see in the FAQ that training is kept under 22 seconds. Moreover, the make_prompt function does not complain about the prompt duration (which I did when I tried a wav file that was too long).

TOTAL duration should be less than 22s, not each single piece of them Use 3s prompt usually delivers better result

Plachtaa avatar Oct 18 '23 18:10 Plachtaa

I created a prompt using 3 seconds of audio. The generated audio has some problems.

https://www.dropbox.com/scl/fi/ewj55pxg9lpgtsf7e6ie7/barackobamafederalplaza_3s.wav?rlkey=py7e9vd88r3fxdl4m4nxok2ii&dl=0

https://www.dropbox.com/scl/fi/kcqn4zqmr8en5a9eqdo30/obama_2_nose.wav?rlkey=cgzcx79u89xbydbsmmofnhjrg&dl=0

dchaws avatar Oct 19 '23 17:10 dchaws

C:\Users\profi\anaconda3\envs\valle\lib\site-packages\torch\nn\utils\weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.") Traceback (most recent call last): File "c:\ai\VALL-E-X\launch-ui.py", line 629, in main() File "c:\ai\VALL-E-X\launch-ui.py", line 528, in main upload_audio_prompt = gr.Audio(label='uploaded audio prompt', source='upload', interactive=True) File "C:\Users\profi\anaconda3\envs\valle\lib\site-packages\gradio\component_meta.py", line 146, in wrapper return fn(self, **kwargs) TypeError: Audio.init() got an unexpected keyword argument 'source'

Help!

Snezhana92 avatar Nov 01 '23 18:11 Snezhana92

Your prompt audio has very strong background noise. If you denoise it, it should get better result.

By the way, according to my very brief experience using this repo, only prompt audio of tts-level quality can produce audio with acceptable quality.

But still, the pronunciation is often far from "natural", I am not sure if it is related to the g2p module or merely due to the small amount of training data. I guess the data quantity is the most important factor.

treya-lin avatar Nov 02 '23 10:11 treya-lin