ggml icon indicating copy to clipboard operation
ggml copied to clipboard

Text to speech models in GGML?

Open simplejackcoder opened this issue 2 years ago • 94 comments
trafficstars

@ggerganov do you have any interest in producing more models in GGML format?

I'm now convinced your approach of zero dependency, no memory allocation cpu-first ideaology will make it accessible to everyone.

You've got LLAMA, whisper, what is remaining is the reverse of whisper.

What are your thoughts?

simplejackcoder avatar Apr 01 '23 15:04 simplejackcoder

How about using vall-e?

simplejackcoder avatar Apr 01 '23 16:04 simplejackcoder

How about using vall-e?

AFAIK Microsoft has not released the weights of VALL-E. They just uploaded the paper to arxiv and set up a demo website with some generation samples.

noe avatar Apr 01 '23 16:04 noe

@ggerganov I hope you make a text to speech example from cpp

gavsidua avatar Apr 04 '23 12:04 gavsidua

Here, there is a TTS pytorch model, which has available weights: https://github.com/r9y9/deepvoice3_pytorch I would be particularly interested in the implemented "nyanko" model (described in https://aclanthology.org/2020.lrec-1.789.pdf). There are several stages of pre-processing in python, but if the model can be ported, porting those to c/c++ could be done afterwards. @ggerganov , whats your assessment on the level of difficulty?

Martin-Laclaustra avatar Apr 10 '23 23:04 Martin-Laclaustra

UP

flosserblossom avatar Apr 16 '23 08:04 flosserblossom

I'm interested in implementing a TTS using ggml, but don't have capacity atm - there are other priorities. Also, I don't think it is worth implementing a model from 3-4 years ago. It should be SOTA. What is SOTA atm?

VALL-E looks like a good candidate - but no weights.

ggerganov avatar Apr 16 '23 09:04 ggerganov

VALL-E looks like a good candidate - but no weights.

It seems quite demanding in terms of training data required (60k hours). Aiming to VALL-E X (multilingual) would be the natural choice (this requires 70K hours), but, apparently (paper), tested only for 2 languages by now. I think it is very unlikely that they release the model, and difficult to have a community based one (at least for a breath of languages). Also, it might be also quite demanding for inference (I know ggml is reaching unbelievable achievements by quantizing, etc. but still...).

On the contrary, the one I proposed (nianko) gets acceptable quality with only ~20h (yes, hours!) of training data, and it can be trained for each language in just 3 days on a single GPU (single speaker). I trained models for 3 speakers (1 non-English language). Let me know if you would like to listen to the samples or test the python implementation. Besides, python inference in CPU is already real-time in modern systems. It would really have outstanding performance based on c.

I believe a desirable TTS would be "universal language" direct unicode text to wav converter, but I have not been able to spot such model.

Martin-Laclaustra avatar Apr 19 '23 00:04 Martin-Laclaustra

With respect to VALL-E, there are 2 pytorch unofficial implementations, none of them implement the VALL-E X (multilanguage), and none of them have released the weights (due to ethical concerns?). https://github.com/enhuiz/vall-e https://github.com/lifeiteng/vall-e I do not have details on the weights size or training/inference requirements.

Compare that to a multilingual TTS with lots of available languages: larynx https://github.com/rhasspy/larynx The quality seems a bit lower. But the training work is done. One may wonder what would be the real advantage of using ggml in this case.

Martin-Laclaustra avatar Apr 19 '23 05:04 Martin-Laclaustra

they don't provide any code, but

https://speechresearch.github.io/naturalspeech2/ https://arxiv.org/abs/2304.09116

We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.

still more diffusion models ...

Green-Sky avatar Apr 19 '23 17:04 Green-Sky

This looks like the best candidate now: https://github.com/suno-ai/bark

ggerganov avatar Apr 20 '23 19:04 ggerganov

This looks like the best candidate now: https://github.com/suno-ai/bark

by far. since they provide the models. (a bit over 12gig)

Green-Sky avatar Apr 20 '23 22:04 Green-Sky

This looks like the best candidate now: https://github.com/suno-ai/bark

their voice creation got reverse engineered. https://github.com/serp-ai/bark-with-voice-clone

Green-Sky avatar Apr 22 '23 11:04 Green-Sky

What about https://github.com/snakers4/silero-models ?

x066it avatar Apr 26 '23 01:04 x066it

A new paper came out called Tango looks pretty good, also using LLMs

mattkanwisher avatar Apr 28 '23 10:04 mattkanwisher

While tango ~looks~ sounds cool, it's a text-to-audio and not a text-to-speech model.

Green-Sky avatar Apr 28 '23 16:04 Green-Sky

@ggerganov is there any possibility that Bark, ported to cpp, would be feasible to run on constrained devices like iPhones? Ex. a device with 4GB RAM and a tolerable limit of model size in the low 100s of MB.

dennislysenko avatar May 09 '23 23:05 dennislysenko

@dennislysenko

by far. since they provide the models. (a bit over 12gig)

even with the best "4bit" quantization imaginable, you still end up with ~3gigs for model files

Green-Sky avatar May 09 '23 23:05 Green-Sky

@Green-Sky

even with the best "4bit" quantization imaginable, you still end up with ~3gigs for model files

Is the quantization referring to the "smaller model" released 05-01?

dennislysenko avatar May 10 '23 00:05 dennislysenko

@dennislysenko no i was talking about ggml, not sure what changes they made in 1.5

Green-Sky avatar May 10 '23 00:05 Green-Sky

@Green-Sky Seems like they refer to smaller model cards as low as 2GB in their README now:

The full version of Bark requires around 12Gb of memory to hold everything on GPU at the same time. However, even smaller cards down to ~2Gb work with some additional settings.

05-01 release notes mention:

We also added an option for a smaller version of Bark, which offers additional speed-up with the trade-off of slightly lower quality.

In theory, could this mean with 4x quantization, it's possible to target ~500MB VRAM?

dennislysenko avatar May 10 '23 00:05 dennislysenko

it's possible to target ~500MB VRAM?

@dennislysenko ggml using vram is very optional. by default ggml only uses ram and cpu. :)

In theory, could this mean with 4x quantization,

their description is very obscure and I dont have the time to look at the code, so maybe

Green-Sky avatar May 10 '23 12:05 Green-Sky

Is there another update about text to speech?

afyacnkep avatar Jun 05 '23 13:06 afyacnkep

it's in roadmap now https://github.com/ggerganov/llama.cpp/discussions/1729

gut4 avatar Jun 07 '23 07:06 gut4

I personally will look into TTS after finishing the SAM implementation. Maybe someone else is already working on TTS inference

ggerganov avatar Jun 18 '23 07:06 ggerganov

It seems that the unlocked Bark with voice cloning is here now: https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer This is a necessary step for a complete system. It is worth to have a look into the open and closed issues there to get an overview of the multiple needed models.

Martin-Laclaustra avatar Jun 18 '23 16:06 Martin-Laclaustra

Also, in llama.cpp May 2023 roadmap, a recent comment suggests "drop-in replacement for EnCodec", which may be (or not) easier to implement.

Martin-Laclaustra avatar Jun 18 '23 16:06 Martin-Laclaustra

The original Bark did sound artificial to me. The voice cloning repos (both) sound amazing already!

Combine that with llm text and the inference speed we see already .. then we have realtime generative speech output

cmp-nct avatar Jun 25 '23 15:06 cmp-nct

it's in roadmap now ggerganov/llama.cpp#1729

That's great news! My only complaint with bark is its speed... your magic touch would be ✨✨✨

kskelm avatar Jul 14 '23 07:07 kskelm

there is now a tracking issue for bark https://github.com/ggerganov/ggml/issues/388 which links https://github.com/PABannier/bark.cpp (wip) :partying_face:

Green-Sky avatar Jul 15 '23 08:07 Green-Sky

While I'm not an expert by any means, VITS in CoquiTTS is almost realtime on CPU (I tested on a medium range laptop CPU). With ggml and a good quant if possible, could almost certainly be realtime, maybe even playing to speakers in realtime too. Just a thought.

TechnotechGit avatar Jul 19 '23 09:07 TechnotechGit