Dragonfire
Dragonfire copied to clipboard
Support for modern TTS models for various languages
Proposal
Consider giving a go to Silero TTS models. These are published under an open license assuming non-commercial / personal usage. Please see our TTS models here - https://github.com/snakers4/silero-models#text-to-speech (corresponding article https://habr.com/ru/post/549482/).
What is most important our TTS models can run on one CPU thread / core decently and depend mostly only on PyTorch.
Just let me repost some of the benchmarks here:
-
RTF (Real Time Factor) - time the synthesis takes divided by audio duration;
-
RTS = 1 / RTF (Real Time Speed) - how much the synthesis is "faster" than realtime;
We benchmarked the models on two devices using Pytorch 1.8 utils:
-
CPU - Intel i7-6800K CPU @ 3.40GHz;
-
GPU - 1080 Ti;
-
When measuring CPU performance, we also limited the number of threads used;
For the 16KHz models we got the following metrics:
| BatchSize | Device | RTF | RTS |
| --------- | ------------- | ----- | ----- |
| 1 | CPU 1 thread | 0.7 | 1.4 |
| 1 | CPU 2 threads | 0.4 | 2.3 |
| 1 | CPU 4 threads | 0.3 | 3.1 |
| 4 | CPU 1 thread | 0.5 | 2.0 |
| 4 | CPU 2 threads | 0.3 | 3.2 |
| 4 | CPU 4 threads | 0.2 | 4.9 |
| --- | ----------- | --- | --- |
| 1 | GPU | 0.06 | 16.9 |
| 4 | GPU | 0.02 | 51.7 |
| 8 | GPU | 0.01 | 79.4 |
| 16 | GPU | 0.008 | 122.9 |
| 32 | GPU | 0.006 | 161.2 |
| --- | ----------- | --- | --- |
For the 8KHz models we got the following metrics:
| BatchSize | Device | RTF | RTS |
| --------- | ------------- | ----- | ----- |
| 1 | CPU 1 thread | 0.5 | 1.9 |
| 1 | CPU 2 threads | 0.3 | 3.0 |
| 1 | CPU 4 threads | 0.2 | 4.2 |
| 4 | CPU 1 thread | 0.4 | 2.8 |
| 4 | CPU 1 threads | 0.2 | 4.4 |
| 4 | CPU 4 threads | 0.1 | 6.6 |
| --- | ----------- | --- | --- |
| 1 | GPU | 0.06 | 17.5 |
| 4 | GPU | 0.02 | 55.0 |
| 8 | GPU | 0.01 | 92.1 |
| 16 | GPU | 0.007 | 147.7 |
| 32 | GPU | 0.004 | 227.5 |
| --- | ----------- | --- | --- |