finetune-whisper-lora
finetune-whisper-lora copied to clipboard
Fine-Tune Whisper with Transformers and PEFT
Finetune Whisper using LoRA for Cantonese and Mandarin
🤗 HF Repo •🐱 Github Repo
Get Started
1. Setup Docker Environment
Switch to the docker folder and build Docker GPU image for training:
cd docker
docker compose build
Onece the building process complete, run the following command to start a Docker container and attach to it:
docker compose up -d
docker exec -it asr bash
2. Prepare Training Data
See detail in dataset_scripts folder.
3. Finetune Pretrained Model
# Finetuning
python finetune.py --model_id base --streaming True --train_batch_size 64 --gradient_accumulation_steps 2 --fp16 True
# LoRA Finetuning
python finetune_lora.py --model_id large-v2 --streaming True --train_batch_size 64 --gradient_accumulation_steps 2
4. Evaluate Performance
# Evaluation
python eval.py --model_name_or_path Oblivion208/whisper-tiny-cantonese --streaming True --batch_size 64
# LoRA Evaluation
python eval_lora.py --peft_model_id Oblivion208/whisper-large-v2-lora-mix --streaming True --batch_size 64
Note: Setting --streaming to False will cache acoustic features on local disk, which speeds up finetuning processes, but it increases the disk usage dramatically (almost three times of raw audio files size).
Approximate Performance Evaluation
The following models are all trained and evaluated on a single RTX 3090 GPU via Vast.ai.
Cantonese Test Results Comparison
MDCC
| Model name | Parameters | Finetune Steps | Time Spend | Training Loss | Validation Loss | CER % | Finetuned Model |
|---|---|---|---|---|---|---|---|
| whisper-tiny-cantonese | 39 M | 3200 | 4h 34m | 0.0485 | 0.771 | 11.10 | Link |
| whisper-base-cantonese | 74 M | 7200 | 13h 32m | 0.0186 | 0.477 | 7.66 | Link |
| whisper-small-cantonese | 244 M | 3600 | 6h 38m | 0.0266 | 0.137 | 6.16 | Link |
| whisper-small-lora-cantonese | 3.5 M | 8000 | 21h 27m | 0.0687 | 0.382 | 7.40 | Link |
| whisper-large-v2-lora-cantonese | 15 M | 10000 | 33h 40m | 0.0046 | 0.277 | 3.77 | Link |
Common Voice Corpus 11.0
| Model name | Original CER % | w/o Finetune CER % | Jointly Finetune CER % |
|---|---|---|---|
| whisper-tiny-cantonese | 124.03 | 66.85 | 35.87 |
| whisper-base-cantonese | 78.24 | 61.42 | 16.73 |
| whisper-small-cantonese | 52.83 | 31.23 | / |
| whisper-small-lora-cantonese | 37.53 | 19.38 | 14.73 |
| whisper-large-v2-lora-cantonese | 37.53 | 19.38 | 9.63 |
Requirements
- Transformers
- Accelerate
- Datasets
- PEFT
- bitsandbytes
- librosa
References
- https://github.com/openai/whisper
- https://huggingface.co/blog/fine-tune-whisper
- https://huggingface.co/docs/peft/task_guides/int8-asr
- https://huggingface.co/alvanlii/whisper-largev2-cantonese-peft-lora