text-generation-webui Coqui AI and Tortoise TTS

Coqui AI and Tortoise TTS

Open da3dsoul opened this issue 1 year ago • 16 comments

As per #885, Coqui has Pros and Cons. It's already written, so PRing it. More soon (tm). When this is accepted, I'll try to write up some more info into the wiki....and figure out how to do that.

Apr 12 '23 20:04 da3dsoul

Tortoise needs more testing by not me. I don't have the hardware to run it alongside any of the models I have. As per discussion in #885, this has 3 Tortoise implementations: official, fast, and MRQ

Apr 15 '23 02:04 da3dsoul

@da3dsoul I used Coqui AI for some time but I'm not super impressed with the quality so far. Maybe I didn't use it in the right way.

Tortoise TTS' quality is unmatched so far. Even by 11Labs.

I have an M2 Max with 96GB RAM and I'd be happy to test anything on this hardware as long as support for Mac is improved. (I use Tortoise TTS fast, but only in a Colab instance because on a Mac, even with my SOTA hardware, it's excruciatingly slow).

Tortoise needs more testing by not me. I don't have the hardware to run it alongside any of the models I have. As per discussion in #885, this has 3 Tortoise implementations: official, fast, and MRQ

Apr 17 '23 12:04 system1system2

I can't do anything about Mac support, since a lot of this stuff uses CUDA (nvidia GPU acceleration). I found the quality of Coqui varies quite a lot depending on which model you use, as it has several choices. The speed and minimum requirements, as well....

Apr 17 '23 12:04 da3dsoul

Coqui is good enough.. it generates a ok-ish voice in a few seconds. Tortoise would need its own GPU.. it's not even about memory, it just takes a while.

Apr 17 '23 14:04 Ph0rk0z

Coqui is about the same quality as Silero, which we already have. It's fine to support it, but I don't think it brings much to new to the table. Tortoise does. And it looks like Bark will too.

Apr 22 '23 03:04 St33lMouse

Coqui actually does have things it brings to the table. It has contextual text normalization of numbers and much better cadence and stressing of syllables in English. This leads to a much more natural sounding sample, even if the audio quality is about the same. It also is faster than Tortoise, but supports custom voices, unlike Silero, even if it isn't as high quality as Tortoise. I think it's a actually a good in-between in terms of speed, spec requirements (GPU), and quality.

Apr 22 '23 03:04 da3dsoul

Coqui can clone, silero can't. The cloning isn't that great but it's fast.

Apr 22 '23 11:04 Ph0rk0z

Hi, many thanks for the work on this voice cloning is super interesting, I am trying to test tortoise mrq version in the 4bit colab on free tier but failing to load the extension, have added the extension flag but it is missing things, any ideas what to add would be much appreciated

Traceback (most recent call last):
  File "/content/text-generation-webui/modules/extensions.py", line 34, in load_extensions
    exec(f"import extensions.{name}.script")
  File "<string>", line 1, in <module>
  File "/content/text-generation-webui/extensions/tortoise_tts_mrq/script.py", line 11, in <module>
    from modules import chat, shared, tts_preprocessor
ImportError: cannot import name 'tts_preprocessor' from 'modules' (unknown location)

May 07 '23 09:05 G-force78

It's in modules/tts_preprocessor.py

May 07 '23 12:05 da3dsoul

Took a bit of a break. @oobabooga aside from merging and fixing the conflicts, what do you want done with this? I think it's pretty much ready

May 08 '23 14:05 da3dsoul

@da3dsoul two questions:

Can you give me a quick rundown on the conclusions of your testing? Have you adopted any of these as your daily driver instead of silero?
Could it be possible to create a general "tts" extension with an output modifier and several submodules, one for each engine, like extensions/tts/silero.py, extensions/tts/elevenlabs.py, etc? Or is that too ambitious?

May 09 '23 02:05 oobabooga

Yes, my thoughts on it from before hold up. Coqui is generally faster and less resource intensive. Tortoise (standard) is most likely to get updates and improve. Tortoise Fast is actually fast enough for me to consider using it. Tortoise MRQ has the most features. I have been running Tortoise Fast, as it's enough of an improvement over Coqui to take the speed hit imo.
I think it's possible, but should be part of a more specialized and generified project. You would have more of an extension pipeline.

Pipeline:

registration (tell the system if the extension is able to be set up and what the extension does. Good for first time users and UI)
UI setup (mostly as we have it. It should be lightweight loading of things like dropdowns and defaults)
pre-LLM (input modifiers)
model loading (this step should be separate and can be ran at startup or on-demand with universal model swapping)
LLM (we could theoretically generify the LLM systems and make them extensions)
Post-LLM (output modifiers, pre-tts, could allow tts_preprocessors as extensions)
TTS (takes the text as input and returns the audio file path. The audio player will be added by the core code)
Model unloading (like the loading step. Could just not be ran or ran after every query for model swapping)
Raw output modifiers (new step. For modifying the resulting html if desired)

Because of the registration step, each extension should have a separate file, register.py. In it, it should have methods for determining if the extension has prerequisites installed, as well as defining the next part. Each extension would have metadata set in the register.py, such as a way to tell it the order to run in the pipeline (priority), a type (TTS, LLM, input/output processor, etc), a variable storing the state and if it errors (good for UI), and some basic UI info to display to the user, like a name, description, etc

May 09 '23 13:05 da3dsoul

I was able to test out your extensions this afternoon. Unfortunately the install procedures have been very tricky, as installing each extension results in dependency conflicts with each other. I was only able to get coqui_tts and tortoise_tts_mrq running, however tortoise refused to perform inference after the models loaded; no errors and no audio.

The primary conflicts I saw were with numpy, numba, librosa, similar to this issue here.

My suggestion would be to have an install similar to bark_tts as a separate repo, where the extension can be installed on it's own.

Thanks for sharing your work, looking forward to testing new TTS implementations.

May 10 '23 01:05 BuffMcBigHuge

Did you use the install scripts? They have some commands that edit the requirements.txt to not cause conflicts with the base repo. I'm not sure about Coqui and Tortoise, as Coqui is installed via pip, making its requirements handled separately

May 10 '23 01:05 da3dsoul

Yup I used the install script provided.

The primary issue is that coqui_tts and it's single pip dependency 'TTS' has major conflicts with the tortoise installs. I had to manually run pip commands to solve these conflicts. I also had to manually download vocoder.pth and put it in /models/tortoise, as it wasn't downloading on it's own.

Just now, I was able to get tortoise_tts_fast working with some finagling. It's great!

May 10 '23 01:05 BuffMcBigHuge

I can say, having used silero more regularly now in tavern, it is fast but very meh. On one GPU this was rather all slow.. but now that I have 2, I'm going to see if the situation improves when dedicating a card to TTS specifically.

If you're going to have a TTS, might as well have a good one.

May 10 '23 12:05 Ph0rk0z

@BuffMcBigHuge can you resolve the conflicts here so I can checkout this PR and test it? Excited about using Coqui

May 21 '23 08:05 ksylvan

@BuffMcBigHuge can you resolve the conflicts here so I can checkout this PR and test it? Excited about using Coqui

Generally, it's the person at the top that you would tag. Sure I can update it

May 21 '23 12:05 da3dsoul

Fix those conflicts finally. @ksylvan

Jun 18 '23 23:06 da3dsoul

Hey sorry if im annoying here, im a bit of a github noob. This seems awesome and I want to try out these extensions. Can someone help me out? How on earth do I download this pull request on its own so I just have the files and can place them where they need to be? Very confused.

Thank you!

Jun 28 '23 17:06 Urammar

Hey sorry if im annoying here, im a bit of a github noob. This seems awesome and I want to try out these extensions. Can someone help me out? How on earth do I download this pull request on its own so I just have the files and can place them where they need to be? Very confused.

Thank you!

Okay I straight up asked GPT, learned git and pulled it. Time to break things! Thanks for the massive effort you guys, will let you know how it goes!

Jun 28 '23 18:06 Urammar

So I ran the scrips in the extension directory (for coqui I think its just pip install -r requirements.txt in that directory?)

I'm guessing this means i dont have a model for this, tts_fast does a very similar thing, but theres no information on where to get one or where to put it once I have it? Little lost now, have to say, sorry all.

Jun 28 '23 18:06 Urammar

I've never seen that one before. Coqui TTS has models downloaded by itself. I can maybe try to reproduce it today.

Tortoise Fast just needs the models copied from normal Tortoise. I'm not sure why they don't include them in the repo.

Jun 28 '23 19:06 da3dsoul

Okay so... was trying to run test.py as... you know.. a test?

Seems like the issue is that tts cant be found, but if I type tts on its own it gives me the syntax for using it?

Jun 28 '23 19:06 Urammar

I wonder if Coqui updated and moved something

Jun 28 '23 19:06 da3dsoul

Also as far as Tortoise

I am really having no luck at all!

Jun 28 '23 19:06 Urammar

Huh which model is that? It does tell you what to check. BigVGAN is missing. I've never heard of that, but that's what it says

Jun 28 '23 19:06 da3dsoul

I wonder if Coqui updated and moved something

Can confirm this is not the case. I installed fresh and it worked

Jun 28 '23 19:06 da3dsoul

Huh which model is that? It does tell you what to check. BigVGAN is missing. I've never heard of that, but that's what it says

Thats just what it says out of the box trying to load the addon in the webui, I have no idea what it is referencing, but its hardcoded in the vocoder.py?

import torch
import torch.nn as nn
import torch.nn.functional as F

import json
from enum import Enum
from typing import Optional, Callable
from dataclasses import dataclass
try:
    from BigVGAN.models import BigVGAN as BVGModel
    from BigVGAN.env import AttrDict
except ImportError:
    raise ImportError(
        "BigVGAN not installed, can't use BigVGAN vocoder\n"
        "Please see the installation instructions on README."
    )

MAX_WAV_VALUE = 32768.0

Jun 28 '23 19:06 Urammar

I downloaded pytorch_model.bin from here and threw it in the models folder, but that didnt change anything either.

Jun 28 '23 19:06 Urammar

Literally all I did was click this and apply and restart interface

Also, ideally, shouldnt there be some kind of model loader box, dropdown, or something? Instead of just erroring out in the terminal? I get this is a work in progress

Jun 28 '23 19:06 Urammar

There should, yes, and that's why I'm confused

Jun 28 '23 21:06 da3dsoul

Capture

Okay so ive done a complete reinstall of ooba, and its working much better now, it even grabbed the model by itself which is nice, and correctly installed those dependencies.

Im now getting the following error when I generate text, though, despite the extension actually loading in the ui and everything else seeming like its playing nice.

It does indeed generate text, but no audio.

Jun 30 '23 16:06 Urammar

text-generation-webui text-generation-webui copied to clipboard

Coqui AI and Tortoise TTS

Pipeline:

text-generation-webui
text-generation-webui copied to clipboard