leon Support for modern TTS models for various languages

Expected Behavior

Support for modern TTS models for various languages without the need for external TTS APIs.

Actual Behavior

Proposal

Consider giving a go to Silero TTS models. These are published under an open license assuming non-commercial / personal usage. Please see our TTS models here - https://github.com/snakers4/silero-models#text-to-speech (corresponding article https://habr.com/ru/post/549482/).

What is most important our TTS models can run on one CPU thread / core decently and depend mostly only on PyTorch.

Just let me repost some of the benchmarks here:

RTF (Real Time Factor) - time the synthesis takes divided by audio duration;
RTS = 1 / RTF (Real Time Speed) - how much the synthesis is "faster" than realtime;

We benchmarked the models on two devices using Pytorch 1.8 utils:

CPU - Intel i7-6800K CPU @ 3.40GHz;
GPU - 1080 Ti;
When measuring CPU performance, we also limited the number of threads used;

For the 16KHz models we got the following metrics:

| BatchSize | Device        | RTF   | RTS   |
| --------- | ------------- | ----- | ----- |
| 1         | CPU 1 thread  | 0.7   | 1.4   |
| 1         | CPU 2 threads | 0.4   | 2.3   |
| 1         | CPU 4 threads | 0.3   | 3.1   |
| 4         | CPU 1 thread  | 0.5   | 2.0   |
| 4         | CPU 2 threads | 0.3   | 3.2   |
| 4         | CPU 4 threads | 0.2   | 4.9   |
| ---       | -----------   | ---   | ---   |
| 1         | GPU           | 0.06  | 16.9  |
| 4         | GPU           | 0.02  | 51.7  |
| 8         | GPU           | 0.01  | 79.4  |
| 16        | GPU           | 0.008 | 122.9 |
| 32        | GPU           | 0.006 | 161.2 |
| ---       | -----------   | ---   | ---   |

For the 8KHz models we got the following metrics:

| BatchSize | Device        | RTF   | RTS   |
| --------- | ------------- | ----- | ----- |
| 1         | CPU 1 thread  | 0.5   | 1.9   |
| 1         | CPU 2 threads | 0.3   | 3.0   |
| 1         | CPU 4 threads | 0.2   | 4.2   |
| 4         | CPU 1 thread  | 0.4   | 2.8   |
| 4         | CPU 1 threads | 0.2   | 4.4   |
| 4         | CPU 4 threads | 0.1   | 6.6   |
| ---       | -----------   | ---   | ---   |
| 1         | GPU           | 0.06  | 17.5  |
| 4         | GPU           | 0.02  | 55.0  |
| 8         | GPU           | 0.01  | 92.1  |
| 16        | GPU           | 0.007 | 147.7 |
| 32        | GPU           | 0.004 | 227.5 |
| ---       | -----------   | ---   | ---   |

Apr 02 '21 06:04 snakers4

Also please note that this is just a V1 release, models will be much faster in future

Apr 02 '21 06:04 snakers4

Hello @snakers4 :wave:,

Thanks for suggesting, it looks promising!

May I know if you have any Node.js binding? As Leon's core is built on the top of Node.js.

Apr 02 '21 08:04 louistiti

We just base our models off PyTorch and / or ONNX As far as I know there are no actively maintained node-js bindings for PyTorch There are though for ONNX, but we could not yet port our TTS models to ONNX

Apr 02 '21 08:04 snakers4

Internally in such cases (where the controlling app and the inference engine are not the same) we just use rabbit-mq communication with a model in a separate container

Apr 02 '21 08:04 snakers4

I see. For the moment Leon does not rely on a broker for such operation but directly on Node.js binding. However, it can be a good path to explore. I'll add it to the roadmap.

Apr 02 '21 08:04 louistiti

✨ Take a look to implement a new offline TTS

Apr 02 '21 08:04 louistiti

For reference, do you have any online demo of the output that you can share?

Apr 02 '21 08:04 louistiti

Please see this article - it has plenty of audios - https://habr.com/ru/post/549482/ Or just use the colab - https://colab.research.google.com/github/snakers4/silero-models/blob/master/examples_tts.ipynb

Since we do not have web developers, we do not active develop fully online web demos Colab can be considered "online" since it works in real-time in a notebook

Apr 02 '21 08:04 snakers4

@snakers4 is it possible to have these voices being rendered into Browser-compatible voices? Then there would no issue which backend had been used to create the voices, right?

May 04 '21 10:05 jankapunkt

Hi,

I could not really understand from their example where / how the actual speech synthesis is run / stored

There is an example code here - https://github.com/mdn/web-speech-api/tree/master/speak-easy-synthesis - but I do not see any models here

May 04 '21 10:05 snakers4

leon leon copied to clipboard

Support for modern TTS models for various languages

Expected Behavior

Actual Behavior

Proposal

leon
leon copied to clipboard