glados-tts
glados-tts copied to clipboard
RuntimeError: Unknown qengine
first of all, before the actual issue, I'd like to suggest adding a requirements.txt file for pip to avoid having to install each lib after encountering an error that it's not found. Now, the actual issue, I got it from torch when it was trying to load glados.pt, from the serialization.py line 162, I'll try to copy the errors.
Traceback (most recent call last): File "C:\Users\ilyes\Desktop\glados-tts\glados.py", line 9, in
hope I can fix it and try the model out with my cpu, and also please add instructions for us to train our own models with our own voice/datasets, thanks, and I'm excited to see this thing grow to be super fast even on a junky cpu like the celeron 3060 that I have
Wow I haven't checked on this repo in a looong time. So sorry I never saw this. The "vocoder-cpu-hq.pt" and lq.pt are optimized for CPU inference. Note that the low quality model sounds considerably worse. I can run this model on my CPU without issues. Perhaps you are running an older version of pytorch? Unfortunately if you are running a Celeron 3060 I don't think you'll be able to train your own model :(. You need a fast CPU and a GPU with at least 12 or so GB of VRAM. I personally used a laptop with an RX6800M (back before that GPU was even officially capable of training NN's!) which has 12GB of VRAM. Regardless, Forward Tacotron is trained in two stages. First, a normal Tacotron model is trained, and then features from that model are used to teach a second model which does not use traditional "attention". This is what allows this TTS to produce passages of pretty much infinite length without becoming word salad. Basically, I trained the normal Tacotron model on pure LJSpeech until it sounded coherent, and then swapped the dataset for a set made of every GLaDOS voice line from Portal 2. Eventually, it began sounding like GLaDOS. I then used this data to train Forward Tacotron. Even with just 600 audio clips, the model turned out quite good. Since Forward Tacotron is not expected to "learn" the rhythm and spacing of the spacing of the syllables from nothing (Tacotron provides them), it allows it to work with much smaller datasets. It basically just has to figure out the rhythm and timbre of a speaker rather than comprehend the whole English language first (fun fact: the first tacotron model was not even provided phonemes! It had to learn pronunciation all by itself!) With even smaller datasets, one possible method to use might be to transfer learn with the first tacotron, then generate all of ljspeech using this TTS, train Forward Tacotron using that data until the speech is coherent, and then finetune it with the original recordings. Forward Tacotron is extremely fast. Most of the slowdown is in the vocoder actually.
Another important thing to note when training your own models is that punctuation is actually extremely important. The stock GLaDOS voice dataset does not have punctuation. I had to add it manually. Makes a HUGE difference.