MeloTTS
MeloTTS copied to clipboard
Coldstart... how to make faster?
Hi there :)
First of all thanks a lot! Finally managed to install successfully, run and generate speech
my current code is:
from MeloTTS.melo.api import TTS
from pydub import AudioSegment
from pydub.playback import play
# Speed is adjustable
speed = 1.0
# CPU is sufficient for real-time inference.
# You can set it manually to 'cpu' or 'cuda' or 'cuda:0' or 'mps'
device = 'auto' # Will automatically use GPU if available
# English
text = "Did you ever hear a folk tale about a giant turtle?"
model = TTS(language='EN', device=device)
speaker_ids = model.hps.data.spk2id
output_path = 'en-default.wav'
while True:
print("Enter message:")
text = input()
model.tts_to_file(text, speaker_ids['EN-Default'], output_path, speed=speed)
print("________")
sound = AudioSegment.from_wav(output_path)
play(sound)
the thing is.. sometimes the generation is INSTANT but MOST times, theres a short lag...
I've noticed that this lag happens for both one-word-generations
and long-text-generations they seem to take around the same time!
Hence! A cold start!
- meaning that something is loaded over and over in the code... and perhaps theres away to keep it loaded, for example setting a local/global variable to save most of the things loaded and try to redo as less as possible when getting a new text to generate
My assumption is that somewhere in this code, you could skip a few fetches that were already been done before:
def tts_to_file(self, text, speaker_id, output_path=None, sdp_ratio=0.2, noise_scale=0.6, noise_scale_w=0.8, speed=1.0, pbar=None, format=None, position=None, quiet=False,):
...
for t in tx:
if language in ['EN', 'ZH_MIX_EN']:
t = re.sub(r'([a-z])([A-Z])', r'\1 \2', t)
device = self.device
bert, ja_bert, phones, tones, lang_ids = utils.get_text_for_tts_infer(t, language, self.hps, device, self.symbol_to_id)
with torch.no_grad():
x_tst = phones.to(device).unsqueeze(0)
tones = tones.to(device).unsqueeze(0)
lang_ids = lang_ids.to(device).unsqueeze(0)
bert = bert.to(device).unsqueeze(0)
ja_bert = ja_bert.to(device).unsqueeze(0)
x_tst_lengths = torch.LongTensor([phones.size(0)]).to(device)
del phones
speakers = torch.LongTensor([speaker_id]).to(device)
audio = self.model.infer(
x_tst,
x_tst_lengths,
speakers,
tones,
lang_ids,
bert,
ja_bert,
sdp_ratio=sdp_ratio,
noise_scale=noise_scale,
noise_scale_w=noise_scale_w,
length_scale=1. / speed,
)[0][0, 0].data.cpu().float().numpy()
del x_tst, tones, lang_ids, bert, ja_bert, x_tst_lengths, speakers
#
audio_list.append(audio)
torch.cuda.empty_cache()
audio = self.audio_numpy_concat(audio_list, sr=self.hps.data.sampling_rate, speed=speed)
{my thoughts}:
like maybe you can skip [torch.cuda.empty_cache() or torch.LongTensor([phones.size(0)]).to(device) or something ] somehow.
maybe one of those vars are the same, or theres a way to make part of them beforehand, or reuse previously, etc
I wouldnt complain, but i have a feeling you can get this to generate much faster please let me know if you get what i'm suggesting, and or if theres a better/recommended way to generate faster and avoid the noticeable coldstart of generations
Thanks a lot and all the best!
I am also interested in this. It doesn't seem like any activity lately - is the unok commit irrelevant?
@unok ? is this fixed?
Is there a solution for this?
@unok ? is this fixed?
Sorry, I don't remember what I was doing at that time, so could you please disregard it?