Lyrics-to-Audio-Alignment
Lyrics-to-Audio-Alignment copied to clipboard
Aligns text (lyrics) with monophonic singing voice (audio). The algorithm uses structural segmentation to segment the audio into structures and then uses hidden markov models to obtain alignment withi...
Lyrics-to-Audio-Alignment
This project aims at creating an automatic alignment between the textual lyrics and monophonic singing vocals (audio). This system shall be very useful in a setting where a karoake performer would want to keep in sync with the background track. Traditional Hidden Markov Models are used for phoneme modelling and an interesting structural segmentation approach has been explored to break the audio (usually of length 4-5 minutes) to smaller chunks that are structurallly meaningful (Intro, Verse, Chorus, etc) without any implicit assumptions.
Watch the Demo
Pre-requisites
- [HTK tool-kit] (http://htk.eng.cam.ac.uk/download.shtml)
- [sph2pipe] (https://www.ldc.upenn.edu/language-resources/tools/sphere-conversion-tools)
- [Flite] (http://www.speech.cs.cmu.edu/flite/download.html)
- [MSAF] (https://github.com/urinieto/msaf/releases)
Training Steps
Training Acoustic models
TIMIT
- Create initial hmm models (isolated phoneme training)
tcsh scripts/model_gen.sh <phonelist> <proto_file>
- Create connected HMM models (embedded re-estimation)
tcsh script/embedded_reestimation.sh <iterations>
Damp
- Align Damp dataset with the generated HMM Models using forced Viterbi alignment
- Perform embedded reestimation using the Damp Dataset to refine the phoneme models.
Structural Segmentation
- Use MSAF library to segment Damp training data into structural segments
python scripts/msaf_segmentation.py <wav_in_dir> <wav_out_dir>
- Create MLF files corresponding to the segmented audio
python scripts/msaf_to_mlf.py <labfile_list>
- Perform embedded reestimation within these segments to get the final phoneme models
Testing
- To test any model do the forced Viterbi alignment initially
sh scripts/force_align.sh
Set the parameters such as model, features, mlf, dictionary, etc inside the file.
- To evaluate the performance of the model, use the manually annotated groundtruth and compute overlap.
python scripts/lab_to_lrc.py <lyrics_list>
Set the groundtruth and output folder inside the script.
Authors
- Phoneme Acoustic Modelling - Rupak Vignesh
- Structural Segmentation with MSAF - Benjamin Genchel
Acknowledgments
- Thanks to Alex Lerch for his guidance
- S Aswin Shanmugham's hybrid segmentation framework
- Stanford's DAMP dataset.