MeloTTS Coldstart... how to make faster?

trafficstars

Hi there :)

First of all thanks a lot! Finally managed to install successfully, run and generate speech

my current code is:

from MeloTTS.melo.api import TTS

from pydub import AudioSegment
from pydub.playback import play

# Speed is adjustable
speed = 1.0

# CPU is sufficient for real-time inference.
# You can set it manually to 'cpu' or 'cuda' or 'cuda:0' or 'mps'
device = 'auto' # Will automatically use GPU if available

# English 
text = "Did you ever hear a folk tale about a giant turtle?"
model = TTS(language='EN', device=device)
speaker_ids = model.hps.data.spk2id
output_path = 'en-default.wav'

while True:
    print("Enter message:")
    text = input()
    model.tts_to_file(text, speaker_ids['EN-Default'], output_path, speed=speed)
    print("________")
    sound = AudioSegment.from_wav(output_path)
    play(sound)

the thing is.. sometimes the generation is INSTANT but MOST times, theres a short lag... I've noticed that this lag happens for both one-word-generations and long-text-generations they seem to take around the same time! Hence! A cold start!

meaning that something is loaded over and over in the code... and perhaps theres away to keep it loaded, for example setting a local/global variable to save most of the things loaded and try to redo as less as possible when getting a new text to generate

My assumption is that somewhere in this code, you could skip a few fetches that were already been done before:

def tts_to_file(self, text, speaker_id, output_path=None, sdp_ratio=0.2, noise_scale=0.6, noise_scale_w=0.8, speed=1.0, pbar=None, format=None, position=None, quiet=False,):
       ...
        for t in tx:
            if language in ['EN', 'ZH_MIX_EN']:
                t = re.sub(r'([a-z])([A-Z])', r'\1 \2', t)
            device = self.device
            bert, ja_bert, phones, tones, lang_ids = utils.get_text_for_tts_infer(t, language, self.hps, device, self.symbol_to_id)
            with torch.no_grad():
                x_tst = phones.to(device).unsqueeze(0)
                tones = tones.to(device).unsqueeze(0)
                lang_ids = lang_ids.to(device).unsqueeze(0)
                bert = bert.to(device).unsqueeze(0)
                ja_bert = ja_bert.to(device).unsqueeze(0)
                x_tst_lengths = torch.LongTensor([phones.size(0)]).to(device)
                del phones
                speakers = torch.LongTensor([speaker_id]).to(device)
                audio = self.model.infer(
                        x_tst,
                        x_tst_lengths,
                        speakers,
                        tones,
                        lang_ids,
                        bert,
                        ja_bert,
                        sdp_ratio=sdp_ratio,
                        noise_scale=noise_scale,
                        noise_scale_w=noise_scale_w,
                        length_scale=1. / speed,
                    )[0][0, 0].data.cpu().float().numpy()
                del x_tst, tones, lang_ids, bert, ja_bert, x_tst_lengths, speakers
                # 
            audio_list.append(audio)
        torch.cuda.empty_cache()
        audio = self.audio_numpy_concat(audio_list, sr=self.hps.data.sampling_rate, speed=speed)

{my thoughts}:

like maybe you can skip [torch.cuda.empty_cache() or torch.LongTensor([phones.size(0)]).to(device)  or something ] somehow.
maybe one of those vars are the same, or theres a way to make part of them beforehand, or reuse previously, etc

I wouldnt complain, but i have a feeling you can get this to generate much faster please let me know if you get what i'm suggesting, and or if theres a better/recommended way to generate faster and avoid the noticeable coldstart of generations

Thanks a lot and all the best!