sherpa-onnx int8 quantized TTS model slower than fp32

(myenv) ubuntu@152:~/sherpa-onnx/python_api_examples$ python3 test.py Elapsed: 0.080 Saved sentence_0.wav. Elapsed: 0.085 Saved sentence_1.wav. Elapsed: 0.080 Saved sentence_2.wav. Elapsed: 0.074 Saved sentence_3.wav. Elapsed: 0.054 Saved sentence_4.wav. Elapsed: 0.081 Saved sentence_5.wav. Elapsed: 0.067

(myenv) ubuntu@152-69-195-75:~/sherpa-onnx/python_api_examples$ python3 test.py Elapsed: 19.561 Saved sentence_0.wav. Elapsed: 26.432 Saved sentence_1.wav. Elapsed: 27.989 Saved sentence_2.wav. Elapsed: 23.956 Saved sentence_3.wav. Elapsed: 11.361 Saved sentence_4.wav. Elapsed: 27.825 Saved sentence_5.wav. Elapsed: 19.567

any special flag to set to use int8?

Feb 07 '24 00:02 martinshkreli

Fangjun will get back to you about it, but: hi, martin shkreli! We might need more hardware info and details about what differed between those two runs.

Feb 07 '24 02:02 danpovey

@martinshkreli

Could you describe how you get the int8 models?

Feb 07 '24 02:02 csukuangfj

Hi guys, thanks again for the wonderful repo. I followed this link to download the model: https://k2-fsa.github.io/sherpa/onnx/tts/pretrained_models/vits.html#download-the-model

Then, I used that file (vits-ljs.int8.onnx) for inference in the python script (offline-tts.py). This was on an 8xA100 instance.

Feb 12 '24 14:02 martinshkreli

@martinshkreli

Could you describe how you get the int8 models?

hi Fangjun, i just wanted to try and get your attention one more time, sorry if I am being annoying!

Feb 16 '24 01:02 martinshkreli

The int8 model is obtained via the following code https://github.com/k2-fsa/sherpa-onnx/blob/d7717628689b051b4c9bffd8d43f3e074388e2d7/scripts/vits/export-onnx-ljs.py#L204-L208

Note that it uses https://github.com/k2-fsa/sherpa-onnx/blob/d7717628689b051b4c9bffd8d43f3e074388e2d7/scripts/vits/export-onnx-ljs.py#L207

It is a known issue about onnxruntime that quint8 is slower.

For instance, if you search with google, you can find similar issues:

https://github.com/microsoft/onnxruntime/issues/12854
https://github.com/microsoft/onnxruntime/issues/6732

Feb 16 '24 12:02 csukuangfj

fangjun, is the int8 intended for different applications or devices then?

On Friday, February 16, 2024, Fangjun Kuang @.***> wrote:

The int8 model is obtained via the following code https://github.com/k2-fsa/sherpa-onnx/blob/d7717628689b051b4c9bffd8d43f3e 074388e2d7/scripts/vits/export-onnx-ljs.py#L204-L208

Note that it uses https://github.com/k2-fsa/sherpa-onnx/blob/d7717628689b051b4c9bffd8d43f3e 074388e2d7/scripts/vits/export-onnx-ljs.py#L207

It is a known issue about onnxruntime that quint8 is slower.

For instance, if you search with google, you can find similar issues:

microsoft/onnxruntime#12854 https://github.com/microsoft/onnxruntime/issues/12854

microsoft/onnxruntime#6732 https://github.com/microsoft/onnxruntime/issues/6732

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/sherpa-onnx/issues/575#issuecomment-1948317748, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO24SJC2ZHERFOMYLKDYT5HQDAVCNFSM6AAAAABC45NFDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBYGMYTONZUHA . You are receiving this because you commented.Message ID: @.***>

Feb 17 '24 05:02 danpovey

int8 model mentioned in this issue is about 4x less in file size than that of float32.

If memory matters, then int8 model is preferred.

Feb 17 '24 05:02 csukuangfj

hi @csukuangfj do you know how to optimize speed of an int8 model? I was experimenting several months ago with it, but i was not able to convert to qint8 and quint8 is really slow on cpu.

Apr 03 '24 07:04 beqabeqa473

You don't need to optimize speed, you need to pick MB-iSTFT VITS model, they are order of magnitude faster than raw VITS with the same quality.

Apr 09 '24 19:04 nshmyrev

You don't need to optimize speed, you need to pick MB-iSTFT VITS model, they are order of magnitude faster than raw VITS with the same quality.

where can we find these models?

Jul 08 '24 19:07 smallbraineng

sherpa-onnx sherpa-onnx copied to clipboard

int8 quantized TTS model slower than fp32

sherpa-onnx
sherpa-onnx copied to clipboard