IPED icon indicating copy to clipboard operation
IPED copied to clipboard

New local audio transcription implementation using wav2vec-2.0 algorithm

Open lfcnassif opened this issue 2 years ago • 26 comments

Awesome work published in this paper: https://arxiv.org/pdf/2107.11414.pdf

Scripts, data sets references and models in this repo: https://github.com/lucasgris/wav2vec4bp

400 hours of pt-BR audios used for training!

Average WER for pt-BR is between 10.5%-12.4% for tested datasets!

lfcnassif avatar Jul 08 '22 00:07 lfcnassif

It is possible to test the transcription sending audios to these sites: https://huggingface.co/lgris/bp_400h_xlsr2_300M https://huggingface.co/lgris/bp400-xlsr

edited: IMHO first uses a language model for portuguese, second uses no LM so it tends to transcribe more phonetically (possibly returning non existent words in pt-BR language but I think it can find words outside the used pt-BR language model )

lfcnassif avatar Jul 08 '22 01:07 lfcnassif

Just found this possible better model using 1B params in portuguese language model (first above uses 300M), no WER reported for now: https://huggingface.co/lgris/bp_400_xlsr2_1B

PS: all models seems Apache licensed :-)

lfcnassif avatar Jul 08 '22 05:07 lfcnassif

The author just put a MIT license term in his repo after I kindly asked to clarify it :-)

lfcnassif avatar Jul 09 '22 04:07 lfcnassif

Transcription time running on i5-8350U CPU (8 logical cores 1.7-1.9Ghz) over 80 small WAV audios (2s-4s) from voxforge test set: 4m25s image

Roughly 1s per audio second

CPU usage was about 50% and RAM usage was about 1.6GB.

lfcnassif avatar Jul 20 '22 02:07 lfcnassif

CPU usage was about 50%

My fault, I forgot my notebook power cable unplugged :-). Plugging it and setting "max performance" in energy settings, CPU usage was about 90%-95% and running time dropped to almost half: 2m17s. RAM usage increased to 2GB-3GB

Testing on the same 301 audios (~5500s duration) data set used here (https://github.com/sepinf-inc/IPED/issues/248#issuecomment-1176868838) with the same 48 threads 2xCPU machine, checking a few transcriptions, accuracy is much better!

But running time increased from 95s by our current vosk implementation to 1650s - 17 times slower - although just 50% of the dual CPU was used, maybe just 1 processor was detected by PyTorch... Using 100% CPUs may decrease running time by half, but that would be still 8.5x slower, and vosk is already slow. Not sure if running this new algorithm on CPUs will be acceptable in practice...

lfcnassif avatar Jul 20 '22 04:07 lfcnassif

Just found a ranking of models, the first place is another one using 1B params + LM: https://huggingface.co/spaces/speech-recognition-community-v2/FinalLeaderboard

@tc-wleite ranking above made me remember you rs

lfcnassif avatar Jul 20 '22 18:07 lfcnassif

https://huggingface.co/spaces/speech-recognition-community-v2/FinalLeaderboard

Just executed the current top pt-BR model on that rank (https://huggingface.co/jonatasgrosman/wav2vec2-xls-r-1b-portuguese) on the 301 audios data set (~5500s) using the 48 threads dual CPU: 2700s running time, just about 50% overall CPU usage.

lfcnassif avatar Jul 20 '22 22:07 lfcnassif

I'm waiting some remote access to a RTX-3090 GPU to measure the inference performance on GPU.

lfcnassif avatar Jul 20 '22 22:07 lfcnassif

@jonatasgrosman also fine tuned great models for english, spanish, german, italian, french and other languages: https://huggingface.co/jonatasgrosman

lfcnassif avatar Jul 21 '22 04:07 lfcnassif

I forgot to quote this awesome repo I found: https://github.com/huggingface/transformers

lfcnassif avatar Jul 22 '22 01:07 lfcnassif

But running time increased from 95s by our current vosk implementation to 1650s - 17 times slower - although just 50% of the dual CPU was used

Running the transcription on that 301 audios data set from the new Iped task took 1280s using 1 CPU (I think the conversion to wav was parallelized) and 750s using both CPUs.

lfcnassif avatar Jul 22 '22 19:07 lfcnassif

It is possible to test the transcription sending audios to these sites: https://huggingface.co/lgris/bp_400h_xlsr2_300M https://huggingface.co/lgris/bp400-xlsr

edited: IMHO first uses a language model for portuguese, second uses no LM so it tends to transcribe more phonetically (possibly returning non existent words in pt-BR language but I think it can find words outside the used pt-BR language model )

Actually, that is wrong. First model used this facebook base model, pretrained in 128 languages, 300M params, for fine tuning: https://huggingface.co/facebook/wav2vec2-xls-r-300m

And second used this model, pretrained in 53 languages (I think it has a similar number of params given the model size): https://huggingface.co/facebook/wav2vec2-large-xlsr-53

I also found this, using the same pretrained model above: https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-portuguese but it was fine tuned using just 1 portuguese data set, while lgris's models quoted at the top of this message used several portuguese data sets

And the top ranked model below used the facebook 128 languages pre-trained model with 1B params: https://huggingface.co/jonatasgrosman/wav2vec2-xls-r-1b-portuguese It also used 4 different portuguese data sets for fine tuning.

So, I "think" the best choices for portuguese are the smaller lgris/bp_400h_xlsr2_300M and the bigger (slower and more memory hungry) jonatasgrosman/wav2vec2-xls-r-1b-portuguese

lfcnassif avatar Jul 23 '22 15:07 lfcnassif

All models above can optionally use a language model or not.

lfcnassif avatar Jul 23 '22 15:07 lfcnassif

Another model (and paper) trained on more spontaneous (interviews, conversations...) and noisy audios according to the authors (in contrast with read or prepared speech): https://huggingface.co/Edresson/wav2vec2-large-xlsr-coraa-portuguese

Although WER is higher on common voice comparing to the others, it might generalize better to other domains.

lfcnassif avatar Jul 25 '22 00:07 lfcnassif

Running the transcription on that 301 audios data set from the new Iped task took 1280s using 1 CPU (I think the conversion to wav was parallelized) and 750s using both CPUs.

@arisjr just run the lgris/bp_400h_xlsr2_300M model on the 301 audios dataset using our RTX 3090 GPU, running time dropped to just 48s! So 15x times faster than running on our 2 x CPUs.

lfcnassif avatar Jul 27 '22 14:07 lfcnassif

Running the transcription on that 301 audios data set from the new Iped task took 1280s using 1 CPU (I think the conversion to wav was parallelized) and 750s using both CPUs.

@arisjr just run the lgris/bp_400h_xlsr2_300M model on the 301 audios dataset using our RTX 3090 GPU, running time dropped to just 48s! So 15x times faster than running on our 2 x CPUs.

And jonatasgrosman/wav2vec2-xls-r-1b-portuguese model took 71s on RTX 3090. Thank you, @arisjr!

lfcnassif avatar Jul 27 '22 14:07 lfcnassif

I'm running several evaluation tests since some days ago using WER metric with models above + Vosk on some public pt-BR data sets I downloaded, without using a language model for now. The results so far are (lower is better):

Dataset Lapsbm Voxforge SID MLS TEDx CORAA Average Weighted Average
Test Set duration (h) 0.1 0.1 1 3.6 1.8 10.7 2.883 -
Models:                
vosk-model-small-pt-0.3 0.195 0.325 0.264 0.37 0.445 0.655 0.376 0.546676301
jonatasgrosman/wav2vec2-large-xlsr-53-portuguese 0.161 0.234 0.214 0.205 0.439 0.619 0.312 0.48583815
jonatasgrosman/wav2vec2-xls-r-1b-portuguese 0.065 0.111 0.119 0.103 0.234 0.293 0.154 0.234895954
lgris/bp_400h_xlsr2_300M 0.074 0.119 0.122 0.111 0.247 0.401 0.179 0.304982659
Edresson/wav2vec2-large-xlsr-coraa-portuguese 0.11 0.189 0.168 0.162 0.321 0.251 0.200 0.233791908

I'm still running the evaluation on Common Voice test set, it will take hours, I'll update the results when finished.

lfcnassif avatar Aug 03 '22 20:08 lfcnassif

Results updated with CommonVoice Test Set. I also fixed the TEDx test set duration since I used a version larger than reported by other project:

image

I also painted as yellow cells those sets which train/dev subsets were used to train each model.

PS1: I don't know what data sets were used to train vosk-model-small-pt-0.3

PS2: lgris/bp_400h_xlsr2_300M also used other data sets in training, which I considered too small or "very easy" to transcribe. Actually Lapsbm and Voxforge are very small, but since I used them in initial tests, I decided to put into the final report.

Given the results, and since Edresson/wav2vec2-large-xlsr-coraa-portuguese model was trained using just one data set (CORAA, a difficult one, together with TEDx), I think the best models between those tested are:

lgris/bp_400h_xlsr2_300M (smaller) jonatasgrosman/wav2vec2-xls-r-1b-portuguese (larger)

as guessed initially :-)

lfcnassif avatar Aug 04 '22 22:08 lfcnassif

Good news, running the MS Azure transcription on TEDx test set resulted in WER = 0.226. So seems we have comparable models, and we aren't using any language model yet :-). From what I have seen, it could decrease WER in about 0.02-0.03 on those data sets.

I'll run Azure impl on the other test data sets and report here.

lfcnassif avatar Aug 05 '22 01:08 lfcnassif

PS: [edited] Azure model inserts useful punctuation marks like dots, interrogation and commas and uses uppercase letters after dots and interrogations, but I had to remove them and convert to lowercase since the expected texts don't have them.

lfcnassif avatar Aug 05 '22 01:08 lfcnassif

Including MS Azure pt-BR model results, standard model as it is today:

image

PS: I took a look at some Azure transcriptions on (easy) SID test set because it seemed a bit high to me: it is giving cardinal numbers (012...) when the "expected" text is string numbers (zero um dois...). But on the other hand this or a similar transformation could be improving Azure results on VoxForge, and maybe a similar situation could be happening with other models with other data sets. As I didn't check all results of all models on all data sets, I leaved WER results as is...

lfcnassif avatar Aug 05 '22 03:08 lfcnassif

Results with current Google standard pt-BR model: image

edited: I'll try their enhanced model for phone calls too.

lfcnassif avatar Aug 05 '22 17:08 lfcnassif

Results including Google phone_call pt-BR model:

image

lfcnassif avatar Aug 05 '22 21:08 lfcnassif

I'll try to gather some real case audios/transcriptions to build an internal test data set to evaluate those models on, so we could check the accuracy on a corpus/domain not used for training for sure (I think there is a lot of bias in some of those models...).

lfcnassif avatar Aug 05 '22 21:08 lfcnassif

Current list of Google's models: image

Maybe I'll try latest_long or video too...

lfcnassif avatar Aug 05 '22 21:08 lfcnassif

Results including Google's latest_long pt-BR model, seems there is no video model for pt-BR today:

image

lfcnassif avatar Aug 07 '22 19:08 lfcnassif

I just finished the section on the wiki manual about how to enable this new local or remote implementation: https://github.com/sepinf-inc/IPED/wiki/User-Manual#wav2vec2

Please let me know if it needs better explanation.

lfcnassif avatar Sep 06 '22 22:09 lfcnassif

@lfcnassif, just a quick feedback here, I downloaded 4.1.0 yesterday and used it to process a new case I am working on, and set audio transcription to use this new algorithm wav2vec2.

Results were really impressive, but as you warned in the configuration file comments and in the Wiki, it is much slower than Vosk, using only CPU. For my particular case, the total processing time was still fine, as there weren't that many audios.

Setup (in Windows) was pretty straightforward, I just followed IPED Wiki's instructions. One minor detail, I got an error message "Error testing FFmpeg, is it on path? Audios longer than 1min need it to be transcribed" that I don't remember seeing before (in 4.0.x). It was trivial to fix though (just downloaded FFmpeg for Windows and placed it in the path). Maybe this could be included in the setup instructions in the Wiki. Isn't it possible to include a FFmpeg Windows executable in IPED's distribution?

wladimirleite avatar Feb 18 '23 14:02 wladimirleite

Thank you for trying this out so quicky! What model did you use? Jonatasgrosman's large one is better, but of course slower.

We can update the wiki for sure. It is possible to embed ffmpeg, I think its license is ok, but AFAIK it is 40-50MB size. Actually I just use ffmpeg to split WAV files. I didn't manage to do it with mplayer, do you know if it is possible?

PS: audio splitting is needed just by this new algorithm and by the google implementation.

lfcnassif avatar Feb 18 '23 15:02 lfcnassif

What model did you use? Jonatasgrosman's large one is better, but of course slower.

I used that large one. As I said, results were very good, considering that the audios were not easy to transcribe (noisy, lot of slangs and so on).

We can update the wiki for sure. It is possible to embed ffmpeg, I think its license is ok, but AFAIK it is 40-50MB size. Actually I just use ffmpeg to split WAV files. I didn't manage to do it with mplayer, do you know if it is possible?

Yes, it would add some extra size to IPED release. I downloaded a "complete" version which is even larger (~120 MB). I am not sure, but maybe it is possible to use MPlayer, which would be nice as it is already used. I am going to check, and let you know if I find a way of using MPlayer instead.

wladimirleite avatar Feb 18 '23 15:02 wladimirleite