leon icon indicating copy to clipboard operation
leon copied to clipboard

Support for modern TTS models for various languages

Open snakers4 opened this issue 4 years ago • 10 comments

Expected Behavior

Support for modern TTS models for various languages without the need for external TTS APIs.

Actual Behavior

Link

Proposal

Consider giving a go to Silero TTS models. These are published under an open license assuming non-commercial / personal usage. Please see our TTS models here - https://github.com/snakers4/silero-models#text-to-speech (corresponding article https://habr.com/ru/post/549482/).

What is most important our TTS models can run on one CPU thread / core decently and depend mostly only on PyTorch.

Just let me repost some of the benchmarks here:

  • RTF (Real Time Factor) - time the synthesis takes divided by audio duration;

  • RTS = 1 / RTF (Real Time Speed) - how much the synthesis is "faster" than realtime;

We benchmarked the models on two devices using Pytorch 1.8 utils:

  • CPU - Intel i7-6800K CPU @ 3.40GHz;

  • GPU - 1080 Ti;

  • When measuring CPU performance, we also limited the number of threads used;

For the 16KHz models we got the following metrics:

| BatchSize | Device        | RTF   | RTS   |
| --------- | ------------- | ----- | ----- |
| 1         | CPU 1 thread  | 0.7   | 1.4   |
| 1         | CPU 2 threads | 0.4   | 2.3   |
| 1         | CPU 4 threads | 0.3   | 3.1   |
| 4         | CPU 1 thread  | 0.5   | 2.0   |
| 4         | CPU 2 threads | 0.3   | 3.2   |
| 4         | CPU 4 threads | 0.2   | 4.9   |
| ---       | -----------   | ---   | ---   |
| 1         | GPU           | 0.06  | 16.9  |
| 4         | GPU           | 0.02  | 51.7  |
| 8         | GPU           | 0.01  | 79.4  |
| 16        | GPU           | 0.008 | 122.9 |
| 32        | GPU           | 0.006 | 161.2 |
| ---       | -----------   | ---   | ---   |

For the 8KHz models we got the following metrics:

| BatchSize | Device        | RTF   | RTS   |
| --------- | ------------- | ----- | ----- |
| 1         | CPU 1 thread  | 0.5   | 1.9   |
| 1         | CPU 2 threads | 0.3   | 3.0   |
| 1         | CPU 4 threads | 0.2   | 4.2   |
| 4         | CPU 1 thread  | 0.4   | 2.8   |
| 4         | CPU 1 threads | 0.2   | 4.4   |
| 4         | CPU 4 threads | 0.1   | 6.6   |
| ---       | -----------   | ---   | ---   |
| 1         | GPU           | 0.06  | 17.5  |
| 4         | GPU           | 0.02  | 55.0  |
| 8         | GPU           | 0.01  | 92.1  |
| 16        | GPU           | 0.007 | 147.7 |
| 32        | GPU           | 0.004 | 227.5 |
| ---       | -----------   | ---   | ---   |

snakers4 avatar Apr 02 '21 06:04 snakers4

Also please note that this is just a V1 release, models will be much faster in future

snakers4 avatar Apr 02 '21 06:04 snakers4

Hello @snakers4 :wave:,

Thanks for suggesting, it looks promising!

May I know if you have any Node.js binding? As Leon's core is built on the top of Node.js.

louistiti avatar Apr 02 '21 08:04 louistiti

We just base our models off PyTorch and / or ONNX As far as I know there are no actively maintained node-js bindings for PyTorch There are though for ONNX, but we could not yet port our TTS models to ONNX

snakers4 avatar Apr 02 '21 08:04 snakers4

Internally in such cases (where the controlling app and the inference engine are not the same) we just use rabbit-mq communication with a model in a separate container

snakers4 avatar Apr 02 '21 08:04 snakers4

I see. For the moment Leon does not rely on a broker for such operation but directly on Node.js binding. However, it can be a good path to explore. I'll add it to the roadmap.

louistiti avatar Apr 02 '21 08:04 louistiti

For reference, do you have any online demo of the output that you can share?

louistiti avatar Apr 02 '21 08:04 louistiti

Please see this article - it has plenty of audios - https://habr.com/ru/post/549482/ Or just use the colab - https://colab.research.google.com/github/snakers4/silero-models/blob/master/examples_tts.ipynb

Since we do not have web developers, we do not active develop fully online web demos Colab can be considered "online" since it works in real-time in a notebook

snakers4 avatar Apr 02 '21 08:04 snakers4

@snakers4 is it possible to have these voices being rendered into Browser-compatible voices? Then there would no issue which backend had been used to create the voices, right?

jankapunkt avatar May 04 '21 10:05 jankapunkt

Hi,

I could not really understand from their example where / how the actual speech synthesis is run / stored

There is an example code here - https://github.com/mdn/web-speech-api/tree/master/speak-easy-synthesis - but I do not see any models here

snakers4 avatar May 04 '21 10:05 snakers4