IPED
IPED copied to clipboard
New local audio transcription implementation using wav2vec-2.0 algorithm
Awesome work published in this paper: https://arxiv.org/pdf/2107.11414.pdf
Scripts, data sets references and models in this repo: https://github.com/lucasgris/wav2vec4bp
400 hours of pt-BR audios used for training!
Average WER for pt-BR is between 10.5%-12.4% for tested datasets!
It is possible to test the transcription sending audios to these sites: https://huggingface.co/lgris/bp_400h_xlsr2_300M https://huggingface.co/lgris/bp400-xlsr
edited: IMHO first uses a language model for portuguese, second uses no LM so it tends to transcribe more phonetically (possibly returning non existent words in pt-BR language but I think it can find words outside the used pt-BR language model )
Just found this possible better model using 1B params in portuguese language model (first above uses 300M), no WER reported for now: https://huggingface.co/lgris/bp_400_xlsr2_1B
PS: all models seems Apache licensed :-)
The author just put a MIT license term in his repo after I kindly asked to clarify it :-)
Transcription time running on i5-8350U CPU (8 logical cores 1.7-1.9Ghz) over 80 small WAV audios (2s-4s) from voxforge test set: 4m25s
Roughly 1s per audio second
CPU usage was about 50% and RAM usage was about 1.6GB.
CPU usage was about 50%
My fault, I forgot my notebook power cable unplugged :-). Plugging it and setting "max performance" in energy settings, CPU usage was about 90%-95% and running time dropped to almost half: 2m17s. RAM usage increased to 2GB-3GB
Testing on the same 301 audios (~5500s duration) data set used here (https://github.com/sepinf-inc/IPED/issues/248#issuecomment-1176868838) with the same 48 threads 2xCPU machine, checking a few transcriptions, accuracy is much better!
But running time increased from 95s by our current vosk implementation to 1650s - 17 times slower - although just 50% of the dual CPU was used, maybe just 1 processor was detected by PyTorch... Using 100% CPUs may decrease running time by half, but that would be still 8.5x slower, and vosk is already slow. Not sure if running this new algorithm on CPUs will be acceptable in practice...
Just found a ranking of models, the first place is another one using 1B params + LM: https://huggingface.co/spaces/speech-recognition-community-v2/FinalLeaderboard
@tc-wleite ranking above made me remember you rs
https://huggingface.co/spaces/speech-recognition-community-v2/FinalLeaderboard
Just executed the current top pt-BR model on that rank (https://huggingface.co/jonatasgrosman/wav2vec2-xls-r-1b-portuguese) on the 301 audios data set (~5500s) using the 48 threads dual CPU: 2700s running time, just about 50% overall CPU usage.
I'm waiting some remote access to a RTX-3090 GPU to measure the inference performance on GPU.
@jonatasgrosman also fine tuned great models for english, spanish, german, italian, french and other languages: https://huggingface.co/jonatasgrosman
I forgot to quote this awesome repo I found: https://github.com/huggingface/transformers
But running time increased from 95s by our current vosk implementation to 1650s - 17 times slower - although just 50% of the dual CPU was used
Running the transcription on that 301 audios data set from the new Iped task took 1280s using 1 CPU (I think the conversion to wav was parallelized) and 750s using both CPUs.
It is possible to test the transcription sending audios to these sites: https://huggingface.co/lgris/bp_400h_xlsr2_300M https://huggingface.co/lgris/bp400-xlsr
edited: IMHO first uses a language model for portuguese, second uses no LM so it tends to transcribe more phonetically (possibly returning non existent words in pt-BR language but I think it can find words outside the used pt-BR language model )
Actually, that is wrong. First model used this facebook base model, pretrained in 128 languages, 300M params, for fine tuning: https://huggingface.co/facebook/wav2vec2-xls-r-300m
And second used this model, pretrained in 53 languages (I think it has a similar number of params given the model size): https://huggingface.co/facebook/wav2vec2-large-xlsr-53
I also found this, using the same pretrained model above: https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-portuguese but it was fine tuned using just 1 portuguese data set, while lgris's models quoted at the top of this message used several portuguese data sets
And the top ranked model below used the facebook 128 languages pre-trained model with 1B params: https://huggingface.co/jonatasgrosman/wav2vec2-xls-r-1b-portuguese It also used 4 different portuguese data sets for fine tuning.
So, I "think" the best choices for portuguese are the smaller lgris/bp_400h_xlsr2_300M
and the bigger (slower and more memory hungry) jonatasgrosman/wav2vec2-xls-r-1b-portuguese
All models above can optionally use a language model or not.
Another model (and paper) trained on more spontaneous (interviews, conversations...) and noisy audios according to the authors (in contrast with read or prepared speech): https://huggingface.co/Edresson/wav2vec2-large-xlsr-coraa-portuguese
Although WER is higher on common voice comparing to the others, it might generalize better to other domains.
Running the transcription on that 301 audios data set from the new Iped task took 1280s using 1 CPU (I think the conversion to wav was parallelized) and 750s using both CPUs.
@arisjr just run the lgris/bp_400h_xlsr2_300M
model on the 301 audios dataset using our RTX 3090 GPU, running time dropped to just 48s! So 15x times faster than running on our 2 x CPUs.
Running the transcription on that 301 audios data set from the new Iped task took 1280s using 1 CPU (I think the conversion to wav was parallelized) and 750s using both CPUs.
@arisjr just run the
lgris/bp_400h_xlsr2_300M
model on the 301 audios dataset using our RTX 3090 GPU, running time dropped to just 48s! So 15x times faster than running on our 2 x CPUs.
And jonatasgrosman/wav2vec2-xls-r-1b-portuguese
model took 71s on RTX 3090. Thank you, @arisjr!
I'm running several evaluation tests since some days ago using WER metric with models above + Vosk on some public pt-BR data sets I downloaded, without using a language model for now. The results so far are (lower is better):
Dataset | Lapsbm | Voxforge | SID | MLS | TEDx | CORAA | Average | Weighted Average |
---|---|---|---|---|---|---|---|---|
Test Set duration (h) | 0.1 | 0.1 | 1 | 3.6 | 1.8 | 10.7 | 2.883 | - |
Models: | ||||||||
vosk-model-small-pt-0.3 | 0.195 | 0.325 | 0.264 | 0.37 | 0.445 | 0.655 | 0.376 | 0.546676301 |
jonatasgrosman/wav2vec2-large-xlsr-53-portuguese | 0.161 | 0.234 | 0.214 | 0.205 | 0.439 | 0.619 | 0.312 | 0.48583815 |
jonatasgrosman/wav2vec2-xls-r-1b-portuguese | 0.065 | 0.111 | 0.119 | 0.103 | 0.234 | 0.293 | 0.154 | 0.234895954 |
lgris/bp_400h_xlsr2_300M | 0.074 | 0.119 | 0.122 | 0.111 | 0.247 | 0.401 | 0.179 | 0.304982659 |
Edresson/wav2vec2-large-xlsr-coraa-portuguese | 0.11 | 0.189 | 0.168 | 0.162 | 0.321 | 0.251 | 0.200 | 0.233791908 |
I'm still running the evaluation on Common Voice test set, it will take hours, I'll update the results when finished.
Results updated with CommonVoice
Test Set. I also fixed the TEDx
test set duration since I used a version larger than reported by other project:
I also painted as yellow cells those sets which train/dev subsets were used to train each model.
PS1: I don't know what data sets were used to train vosk-model-small-pt-0.3
PS2: lgris/bp_400h_xlsr2_300M
also used other data sets in training, which I considered too small or "very easy" to transcribe. Actually Lapsbm
and Voxforge
are very small, but since I used them in initial tests, I decided to put into the final report.
Given the results, and since Edresson/wav2vec2-large-xlsr-coraa-portuguese
model was trained using just one data set (CORAA
, a difficult one, together with TEDx
), I think the best models between those tested are:
lgris/bp_400h_xlsr2_300M
(smaller)
jonatasgrosman/wav2vec2-xls-r-1b-portuguese
(larger)
as guessed initially :-)
Good news, running the MS Azure transcription on TEDx test set resulted in WER = 0.226. So seems we have comparable models, and we aren't using any language model yet :-). From what I have seen, it could decrease WER in about 0.02-0.03 on those data sets.
I'll run Azure impl on the other test data sets and report here.
PS: [edited] Azure model inserts useful punctuation marks like dots, interrogation and commas and uses uppercase letters after dots and interrogations, but I had to remove them and convert to lowercase since the expected texts don't have them.
Including MS Azure pt-BR model results, standard model as it is today:
PS: I took a look at some Azure transcriptions on (easy) SID test set because it seemed a bit high to me: it is giving cardinal numbers (012...) when the "expected" text is string numbers (zero um dois...). But on the other hand this or a similar transformation could be improving Azure results on VoxForge, and maybe a similar situation could be happening with other models with other data sets. As I didn't check all results of all models on all data sets, I leaved WER results as is...
Results with current Google standard pt-BR model:
edited: I'll try their enhanced model for phone calls too.
Results including Google phone_call pt-BR model:
I'll try to gather some real case audios/transcriptions to build an internal test data set to evaluate those models on, so we could check the accuracy on a corpus/domain not used for training for sure (I think there is a lot of bias in some of those models...).
Current list of Google's models:
Maybe I'll try latest_long or video too...
Results including Google's latest_long pt-BR model, seems there is no video model for pt-BR today:
I just finished the section on the wiki manual about how to enable this new local or remote implementation: https://github.com/sepinf-inc/IPED/wiki/User-Manual#wav2vec2
Please let me know if it needs better explanation.
@lfcnassif, just a quick feedback here, I downloaded 4.1.0 yesterday and used it to process a new case I am working on, and set audio transcription to use this new algorithm wav2vec2.
Results were really impressive, but as you warned in the configuration file comments and in the Wiki, it is much slower than Vosk, using only CPU. For my particular case, the total processing time was still fine, as there weren't that many audios.
Setup (in Windows) was pretty straightforward, I just followed IPED Wiki's instructions. One minor detail, I got an error message "Error testing FFmpeg, is it on path? Audios longer than 1min need it to be transcribed" that I don't remember seeing before (in 4.0.x). It was trivial to fix though (just downloaded FFmpeg for Windows and placed it in the path). Maybe this could be included in the setup instructions in the Wiki. Isn't it possible to include a FFmpeg Windows executable in IPED's distribution?
Thank you for trying this out so quicky! What model did you use? Jonatasgrosman's large one is better, but of course slower.
We can update the wiki for sure. It is possible to embed ffmpeg, I think its license is ok, but AFAIK it is 40-50MB size. Actually I just use ffmpeg to split WAV files. I didn't manage to do it with mplayer, do you know if it is possible?
PS: audio splitting is needed just by this new algorithm and by the google implementation.
What model did you use? Jonatasgrosman's large one is better, but of course slower.
I used that large one. As I said, results were very good, considering that the audios were not easy to transcribe (noisy, lot of slangs and so on).
We can update the wiki for sure. It is possible to embed ffmpeg, I think its license is ok, but AFAIK it is 40-50MB size. Actually I just use ffmpeg to split WAV files. I didn't manage to do it with mplayer, do you know if it is possible?
Yes, it would add some extra size to IPED release. I downloaded a "complete" version which is even larger (~120 MB). I am not sure, but maybe it is possible to use MPlayer, which would be nice as it is already used. I am going to check, and let you know if I find a way of using MPlayer instead.