supervoice-librilight-preprocessed
supervoice-librilight-preprocessed copied to clipboard
60k hours of phoneme-aligned audio from audio books
Librilight Preprocessed
This is a preprocessed librilight
dataset with ASR using Whisper Large model and then aligned with montreal forced aligner.
[!WARNING]
This attempt of building a good large aligned dataset failed due to too low quality of whisper ASR for such task. Use libriheavy instead.
Dataset Structure
Structure is similar to original dataset. Each file is represented in three formats: .flac
with audio, .txt
with text, .TextGrid
with alignment. Top level folders are speakers, next one is a session and then the files split into up to 30 seconds
with rougtly 15 seconds
on average.
Downloads
This dataset can easily be downloaded using my datasets tool using identifiers: librilight-processed
, librilight-processed@medium
and librilight-processed@large
.
Or it can be downloaded directly from my server.
Reproduction
To reproduce the dataset you need to execute the following steps:
datasets sync # Download source datasets (depends on your network)
./prepare_cut.sh # Cut the audio files (fast)
./prepare_transcribe.sh # Transcribe the audio files using Whisper, this could take days and GPUs are needed
./prepare_align.sh # Align the transcribed files using montreal forced aligner, this could take days
./prepare_final.sh # Prepare the final datasets
License
MIT