Feature Request: Add support for Kokoro TTS
Prerequisites
- [X] I am running the latest code. Mention the version if possible as well.
- [X] I carefully followed the README.md.
- [X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Devs, can you add support for Kokoro TTS? It's awesome in terms of accents and natural tone, considering it's size. It is currently one of the most popular models in Pandroker's TTS arena space on hugginface. Thanks! https://huggingface.co/hexgrad/Kokoro-82M
Motivation
Many, including me want to deploy it on cpu/edge devices
Possible Implementation
No response
+1
+1. The claim is that it's faster than realtime on the Mac.
+1
+1
+1
+1
+1
+1 🎯
+1
+1
+1
+1
+2
+1
+1
+1
+1 Would be cool to see more tts options in llama.cpp
These can be reproduced at https://hf.co/spaces/hexgrad/Kokoro-TTS without installing anything.
I'm sorry Dave, I'm afraid I can't do that. https://github.com/ggerganov/llama.cpp/pull/10784#issue-2733486635
ˌIm sˈɔɹi dˈAv, ˌIm əfɹˈAd ˌI kˈænt dˈu ðˈæt.
https://github.com/user-attachments/assets/d80f9d68-d7d4-4b84-bd7b-26c6ae87ad38
TTS requires 2 models to be provided: an LLM and a Vocoder. The first one generates audio codes (tokens) from the provided input text, based on some voice settings. The second one converts the audio codes into a spectrogram. The spectrogram is then converted back to audio with inverse FFT. https://github.com/ggerganov/llama.cpp/pull/10784#issuecomment-2536969458
tˌitˌiˈɛs ɹəkwˈIəɹz tˈu mˈɑdᵊlz tə bi pɹəvˈIdᵻd: ɐn ˌɛlˌɛlˈɛm ænd ɐ vˈOkˌOdəɹ. ðə fˈɜɹst wˈʌn ʤˈɛnəɹˌAts ˈɔdiO kˈOdz (tˈOkᵊnz) fɹʌm ðə pɹəvˈIdᵻd ˈɪnpˌʊt tˈɛkst, bˈAst ˌɔn sˌʌm vˈYs sˈɛTɪŋz. ðə sˈɛkənd wˈʌn kənvˈɜɹts ði ˈɔdiO kˈOdz ˈɪntu ɐ spˈɛktɹəɡɹˌæm. ðə spˈɛktɹəɡɹˌæm ɪz ðˈɛn kənvˈɜɹTᵻd bˈæk tʊ ˈɔdiO wɪð ˈɪnvˌɜɹs ˌɛfˌɛftˈi.
https://github.com/user-attachments/assets/7189a07a-2144-4815-a41c-aa0679bdefff
Not sure how to pass punctuation yet. Or even if this model supports it. https://github.com/ggerganov/llama.cpp/pull/10784#issuecomment-2536969458
nˌɑt ʃˈʊɹ hˌW tə pˈæs pˌʌŋkʧəwˈAʃən jˈɛt. ˌɔɹ ˈivən ɪf ðɪs mˈɑdᵊl səpˈɔɹts ɪt.
https://github.com/user-attachments/assets/4f3de736-7af5-4b07-bd3e-852478cc847e
@hexgrad are those reprods with a C++ implementation?
@namhkoh No, it's Python & PyTorch, as I mentioned https://github.com/ggerganov/llama.cpp/issues/11050#issuecomment-2628700821
These can be reproduced at https://hf.co/spaces/hexgrad/Kokoro-TTS without installing anything.
There is an onnx/c# implimentation of Kokoro here https://github.com/Lyrcaxis/KokoroSharp
But I think? (not sure) its using espeak as the phonemiser? which is different? to how the Python & Pytorch version works? That use G2P?
Am I correct here? @hexgrad ?
I am currently seeking a c++ implementation.
You need G2P to make the whole thing work, but llama.cpp can probably disregard that piece for now—the c++ scope for llama.cpp would likely just be porting the modeling code in these 3 files:
- https://github.com/hexgrad/kokoro/blob/1145c0b7f6f3c781d35b1b67a283a32580bc5acd/kokoro/model.py
- https://github.com/hexgrad/kokoro/blob/1145c0b7f6f3c781d35b1b67a283a32580bc5acd/kokoro/modules.py
- https://github.com/hexgrad/kokoro/blob/1145c0b7f6f3c781d35b1b67a283a32580bc5acd/kokoro/istftnet.py
I am currently seeking a c++ implementation.
@namhkoh
We supported kokoro in sherpa-onnx a long time ago.
It provides not only C++ APIs for Kokoro v0.19 and Kokoro 1.0, but it also supports 11 other programming languages, e.g., C, Java, Kotlin, Swift, Dart, C#, Go, JavaScript, Object Pascal, Python.
You can find the usage doc at https://k2-fsa.github.io/sherpa/onnx/tts/pretrained_models/kokoro.html
is there any update on this?
+1
+1
+1
+1
+1