voice-dataset-creation
voice-dataset-creation copied to clipboard
Tools to create your own voice dataset for TTS training
Voice Dataset Creation
This repo outlines the steps and scripts necessary to create your own text-to-speech dataset for training a voice model. The final output is in LJSpeech format.

Table of Contents
- Create Your Own Voice Recordings
- Create a Synthetic Voice Dataset
- Create Transcriptions for Existing Voice Recordings
- Other Utilities
Create Your Own Voice Recordings
Requirements
- Voice Recording Software
- Omni-directional head-mounted microphone
- Good quality audio card
Create a Text Corpus of Sentences
- Create sentences that will be about 3-10 seconds when spoken
- Use LJSpeech format
- "|" separated values, wav file id then sentence text
100|this is an example sentence
Speak and Record Sentences
- Speak each sentence as written
- Sample rate should be 22050 or greater
Sentence Lengths
Run scripts/wavdurations2csv.sh to chart out sentence length and verify that you have a good distribution of WAV file lengths.
Create a Synthetic Voice Dataset
Requirements
- Google Cloud Platform Compute Engine Instance
Cloud API access scopesselectAllow full access to all Cloud APIs
- Conda
Installation
Create Conda Environment on GCP Instance
conda create -n tts python=3.7
conda activate tts
pip install google-cloud-texttospeech==2.1.0 tqdm pandas
Create a Text Corpus of Sentences
- Create sentences that will be about 3-10 seconds when spoken
- Use LJSpeech format
- "|" separated values, wav file id then sentence text
100|this is an example sentence
Generate Synthetic Voice Dataset
python text_to_wav.py tts_generate
Sentence Lengths
Run scripts/wavdurations2csv.sh to chart out sentence length and verify that you have a good distribution of WAV file lengths.
Create Transcriptions for Existing Voice Recordings
Requirements
- Adobe Audition or Audacity
- Google Cloud Platform Compute Engine Instance
Cloud API access scopesselectAllow full access to all Cloud APIs
- Conda
Installation
Create Conda Environment on GCP Instance
conda create -n stt python=3.7
conda activate stt
pip install google-cloud-speech tqdm pandas
Fill out a Datasheet for the Voice Dataset
- Review Datasheets for Datasets by Gebru et al.: https://arxiv.org/pdf/1803.09010.pdf
- Markdown Datasheet: https://github.com/JRMeyer/markdown-datasheet-for-datasets/blob/master/DATASHEET.md
Mark the Speech
In Adobe Audition, open audio file:
- Select
Diagnostics->Mark Audio - Select the
Mark the Speechpreset - Click
Scan - Click
Find Levels - Click
Scanagain - Click
Mark All - Adjust audio and silence signal dB and length until clips are between 3-10 seconds
Or, in Audacity, open audio file:
- Select
Analyze->Sound Finder - Adjust audio and silence signal dB and length until clips are between 3-10 seconds
Adjust Markers or Label Boundaries
In Audition:
- Open
MarkersTab - Adjust markers, removing silence and noise to make clip length between 3 to 10 seconds long
In Audition:
- Adjust label boundaries, removing silence and noise to make clip length between 3 to 10 seconds long
Export Markers/Labels and WAVs
In Audition:
- Select all markers in list
- Select
Export Selected Markers to CSVand save as Markers.csv - Select
Preferences->Media & Disk Cacheand UntickSave Peak Files - Select
Export Audio of Selected Range Markerswith the following options:- Check
Use marker names in filenames - Update Format to
WAV PCM - Update Sample Type
22050 Hz Mono, 16-bit - Use folder
wavs_export
- Check
Or, in Audacity:
- Select
Export multiple...- Format: WAV
- Options: Signed 16-bit PCM
- Split files based on Labels
- Name files using Label/Track Name
- Use folder
wavs_export
- Select
Export labelstoLabel Track.txt
Analyze WAVs with Signal to Noise Ratio Colab
- run colabs/voice_dataset_SNR.ipynb
- Clean or remove noisy files
Create Initial Transcriptions with STT
For Audition, using the exported Markers.csv and wavs folder run:
cd scripts
python wav_to_text.py audition
The script generates a new file, Markers_STT.csv.
For Audacity, using the exported Label Track.txt and wavs folder run:
cd scripts
python wav_to_text.py audacity
The script generates a new file, Label Track STT.csv.
Fine-tune Transcriptions
For Audition:
- Delete all markers
- Select
Import Markers from Fileand select file with STT transcriptions: Markers_STT.csv - Fine-tune the Description field in Markers to exactly match the words spoken
For Audacity:
- Open
Label Track STT.txtin a text editor. - Fine-tune the Labels field in the text file to exactly match the words spoken
Export Markers (Audition only) and WAVs
For Audition:
- Select all markers in list
- Select
Export Selected Markers to CSVand save as Markers.csv - Select
Export Audio of Selected Range Markerswith the following options:- Check
Use marker names in filenames - Update Format to
WAV PCM - Update Sample Type
22050 Hz Mono, 16-bit - Use folder
wavs_export
- Check
For Audacity:
- Select
Export multiple...- Format: WAV
- Options: Signed 16-bit PCM
- Split files based on Labels
- Name files using Label/Track Name
- Use folder
wavs_export
Convert Markers(Audition) or Labels(Audacity) into LJSpeech format
Using the exported Markers.csv(Audition) or Label Track STT.txt (Audacity) and WAVs in wavs_export, scripts/markersfile_to_metadata.py will create a metadata.csv and folder of WAVs to train your TTS model:
For Audition:
python markersfile_to_metadata.py audition
For Audacity:
python markersfile_to_metadata.py audacity
Sentence Lengths
Run scripts/wavdurations2csv.sh to chart out sentence length and verify that you have a good distribution of WAV file lengths.
Other Utilities
Upsample WAV file
ffmpeg:
resampy:
We tested three methods to upsample WAV files from 16,000 to 22,050 Hz. After reviewing the spectrograms, we selected ffmpeg for upsampling as it includes another 2 KHz of high end information when compared to resampy. scripts/resamplewav.sh
scripts/resamplewav.sh
References
- Mozilla TTS: https://github.com/mozilla/TTS
- Automating alignment, includes segment audio on silence, Google Speech API, and recognition alignment: https://github.com/carpedm20/multi-Speaker-tacotron-tensorflow#2-2-generate-korean-datasets
- Pretraining on large synthetic corpuses and fine tuning on specific ones https://twitter.com/garygarywang
- Datasheets for Datasets https://arxiv.org/abs/1803.09010