IPED
IPED copied to clipboard
#1214 Wav2vec2 audio transcription
When finished this will close #1214.
Running directly from a python task (using jep) didn't work, model loading freezes... so I'm using an external python process to do the hard work. Anyway, this approach would be needed for the final implementation as an optional remote service. TODOs:
- [x] break large audios to avoid possible OOMs. I didn't get any, but seems it may occur: https://github.com/jonatasgrosman/huggingsound/issues/13
- [x] improve inter process communication error control
- [ ] check if suggested models are using a Language Model (it generally improves accuracy) or not and create a configuration to enable/disable that
- [x] test performance on our RTX 3090 GPU
- [ ] change IPC to use TCP sockets to allow running on other machine (with GPUs) as a service
- [x] investigate how to use all available processors (don't confuse with cores)
- [x] compare
lgris/bp_400h_xlsr2_300M
andjonatasgrosman/wav2vec2-large-xlsr-53-portuguese
models to decide which will be the pt-BR suggested default
Using huggingsound library made this much easier than expected :-). For those interested in testing this, you just need to:
pip install huggingsound
into iped embedded python.
PS: First run will download the models from HuggingFace hub. They are 1.2GB-3.5GB size and this is often being aborted by our office network provider.
Now it's using all available processor sockets and the robustness was improved. I think this can be used in real world cases for those interested.
I'll take care of the other TODOs in 10 days when I return back to office, mainly the remote service implementation.
I just pushed an initial simple implementation for running wav2vec2 algorithm as a service. It uses a very simple client side load balancing approach for now. It has 3 components:
- a "naming" central node: it must be started initially and listen for workers registration and client queries for registered workers
- worker nodes: register itself at start up in the naming node and listen for client direct requests
- client node: query the naming node for registered workers and send requests directly to them using a simple circular queue. If the worker is busy, try the next one;
I'll do more robustness testing in the next days...
Maybe this very simple client side load balancing may cause some starvation for clients with large latency, since their requests could take more time to travel, resulting in less requests per second than near clients, so the last ones could have more chance to pick up an idle worker... Not sure if that hypothesis could be a real problem. If yes, we can change to an approach where the central node would coordinate the client requests queue and idle workers using a fairness policy.
Possible improvements:
- configure an email address to receive alerts about the naming or some worker node going down ~~(this I may implement soon)~~
- encrypt the audios and transcriptions being sent and received (is this really needed? I have an implementation idea)
- ~~convert audios to wav on server side instead of client side to decrease (a lot) network usage~~
- ~~implement some kind of authentication to use the service~~, but I think some simple firewall rules can avoid unauthorized service access
- convert audios to wav on server side instead of client side to decrease (a lot) network usage
Just added the TODO above.
About the possible starvation, I got an idea, maybe subtracting the latency time from the fixed sleep time (currently 100ms) between transcription retries when some server is busy can bring more fairness to the acceptance of requests from near and far clients.
Just to write down somewhere before putting into the docs. I'm having some headaches on Windows trying to install libs to use language models: pip install pyctcdecode pip install https://github.com/kpu/kenlm/archive/master.zip
About the first, should be something missing in our embedded python, because it works after installing the lib in an external python instance.
About the last, I found https://github.com/kpu/kenlm/issues/364. There is a merged PR in a fork that fixed it, and this works: pip install -e git+https://github.com/kpu/kenlm.git@f01e12d83c7fd03ebe6656e0ad6d73a3e022bd50#egg=kenlm
I'm having some issues while trying to use a language model, I asked for help here: https://github.com/jonatasgrosman/huggingsound/issues/62
As it is now, we are getting accuracy results similar to Microsoft and Google on the tested datasets. I'll leave the language model boost as a future improvement.
I just need to test this on a machine with more than 1 GPU to be sure all of them will be used correctly, so I think this could be merged.
@FelipeFcosta, when you have time, I also ask you for helping me testing this, thanks.
You will need to install huggingsound lib into IPED's embedded python, enable audio transcription in IPEDConfig.txt and to change the implementation from Vosk to Wav2Vec2 in conf/AudioTranscriptionConfig.txt
We are experiencing often reboots when using two GPUs RTX 3090 on the same server, seems a hardware issue, for reference: https://github.com/pytorch/pytorch/issues/3022
Decreasing the GPUs max power from 370W to 200W using nvidia-smi -pl 200
seems to workaround, 250W is not enough. We will try to replace the 1200W power unit...
I think this is finished, the double GPU hardware issue won't be solved by this software module. For those that could help testing: https://github.com/sepinf-inc/IPED/wiki/User-Manual#wav2vec2
Just tested this on a 3 nodes cluster:
- 01 worker node with 05 GPUs
- 01 worker node with 01 GPU enabled (the one restarting when 2 are used)
- 01 worker node with 02 CPUs + the naming service
After last commits, everything seems fine. The node with 05 GPUs is a rack one and is stable, so restarts seem a power supply unit issue indeed. Going to merge this. Anyway, additional tests are welcome.