leon
leon copied to clipboard
Support for modern TTS models for various languages
Expected Behavior
Support for modern TTS models for various languages without the need for external TTS APIs.
Actual Behavior
Proposal
Consider giving a go to Silero TTS models. These are published under an open license assuming non-commercial / personal usage. Please see our TTS models here - https://github.com/snakers4/silero-models#text-to-speech (corresponding article https://habr.com/ru/post/549482/).
What is most important our TTS models can run on one CPU thread / core decently and depend mostly only on PyTorch.
Just let me repost some of the benchmarks here:
-
RTF (Real Time Factor) - time the synthesis takes divided by audio duration;
-
RTS = 1 / RTF (Real Time Speed) - how much the synthesis is "faster" than realtime;
We benchmarked the models on two devices using Pytorch 1.8 utils:
-
CPU - Intel i7-6800K CPU @ 3.40GHz;
-
GPU - 1080 Ti;
-
When measuring CPU performance, we also limited the number of threads used;
For the 16KHz models we got the following metrics:
| BatchSize | Device | RTF | RTS |
| --------- | ------------- | ----- | ----- |
| 1 | CPU 1 thread | 0.7 | 1.4 |
| 1 | CPU 2 threads | 0.4 | 2.3 |
| 1 | CPU 4 threads | 0.3 | 3.1 |
| 4 | CPU 1 thread | 0.5 | 2.0 |
| 4 | CPU 2 threads | 0.3 | 3.2 |
| 4 | CPU 4 threads | 0.2 | 4.9 |
| --- | ----------- | --- | --- |
| 1 | GPU | 0.06 | 16.9 |
| 4 | GPU | 0.02 | 51.7 |
| 8 | GPU | 0.01 | 79.4 |
| 16 | GPU | 0.008 | 122.9 |
| 32 | GPU | 0.006 | 161.2 |
| --- | ----------- | --- | --- |
For the 8KHz models we got the following metrics:
| BatchSize | Device | RTF | RTS |
| --------- | ------------- | ----- | ----- |
| 1 | CPU 1 thread | 0.5 | 1.9 |
| 1 | CPU 2 threads | 0.3 | 3.0 |
| 1 | CPU 4 threads | 0.2 | 4.2 |
| 4 | CPU 1 thread | 0.4 | 2.8 |
| 4 | CPU 1 threads | 0.2 | 4.4 |
| 4 | CPU 4 threads | 0.1 | 6.6 |
| --- | ----------- | --- | --- |
| 1 | GPU | 0.06 | 17.5 |
| 4 | GPU | 0.02 | 55.0 |
| 8 | GPU | 0.01 | 92.1 |
| 16 | GPU | 0.007 | 147.7 |
| 32 | GPU | 0.004 | 227.5 |
| --- | ----------- | --- | --- |
Also please note that this is just a V1 release, models will be much faster in future
Hello @snakers4 :wave:,
Thanks for suggesting, it looks promising!
May I know if you have any Node.js binding? As Leon's core is built on the top of Node.js.
We just base our models off PyTorch and / or ONNX As far as I know there are no actively maintained node-js bindings for PyTorch There are though for ONNX, but we could not yet port our TTS models to ONNX
Internally in such cases (where the controlling app and the inference engine are not the same) we just use rabbit-mq communication with a model in a separate container
I see. For the moment Leon does not rely on a broker for such operation but directly on Node.js binding. However, it can be a good path to explore. I'll add it to the roadmap.
For reference, do you have any online demo of the output that you can share?
Please see this article - it has plenty of audios - https://habr.com/ru/post/549482/ Or just use the colab - https://colab.research.google.com/github/snakers4/silero-models/blob/master/examples_tts.ipynb
Since we do not have web developers, we do not active develop fully online web demos Colab can be considered "online" since it works in real-time in a notebook
@snakers4 is it possible to have these voices being rendered into Browser-compatible voices? Then there would no issue which backend had been used to create the voices, right?
Hi,
I could not really understand from their example where / how the actual speech synthesis is run / stored
There is an example code here - https://github.com/mdn/web-speech-api/tree/master/speak-easy-synthesis - but I do not see any models here