Voice Dataset Creation

This repo outlines the steps and scripts necessary to create your own text-to-speech dataset for training a voice model. The final output is in LJSpeech format.

Flow Chart

Create Your Own Voice Recordings
Create a Synthetic Voice Dataset
Create Transcriptions for Existing Voice Recordings
Other Utilities

Create Your Own Voice Recordings

Requirements

Voice Recording Software
Omni-directional head-mounted microphone
Good quality audio card

Create a Text Corpus of Sentences

Create sentences that will be about 3-10 seconds when spoken
Use LJSpeech format
- "|" separated values, wav file id then sentence text
- 100|this is an example sentence

Speak and Record Sentences

Speak each sentence as written
Sample rate should be 22050 or greater

Sentence Lengths

Run scripts/wavdurations2csv.sh to chart out sentence length and verify that you have a good distribution of WAV file lengths.

Create a Synthetic Voice Dataset

Requirements

Google Cloud Platform Compute Engine Instance
- Cloud API access scopes select Allow full access to all Cloud APIs
Conda

Installation

Create Conda Environment on GCP Instance

conda create -n tts python=3.7
conda activate tts
pip install google-cloud-texttospeech==2.1.0 tqdm pandas

Create a Text Corpus of Sentences

Create sentences that will be about 3-10 seconds when spoken
Use LJSpeech format
- "|" separated values, wav file id then sentence text
- 100|this is an example sentence

Generate Synthetic Voice Dataset

python text_to_wav.py tts_generate

Sentence Lengths

Run scripts/wavdurations2csv.sh to chart out sentence length and verify that you have a good distribution of WAV file lengths.

Create Transcriptions for Existing Voice Recordings

Requirements

Adobe Audition or Audacity
Google Cloud Platform Compute Engine Instance
- Cloud API access scopes select Allow full access to all Cloud APIs
Conda

Installation

Create Conda Environment on GCP Instance

conda create -n stt python=3.7
conda activate stt
pip install google-cloud-speech tqdm pandas

Fill out a Datasheet for the Voice Dataset

Review Datasheets for Datasets by Gebru et al.: https://arxiv.org/pdf/1803.09010.pdf
Markdown Datasheet: https://github.com/JRMeyer/markdown-datasheet-for-datasets/blob/master/DATASHEET.md

Mark the Speech

In Adobe Audition, open audio file:

Select Diagnostics -> Mark Audio
Select the Mark the Speech preset
Click Scan
Click Find Levels
Click Scan again
Click Mark All
Adjust audio and silence signal dB and length until clips are between 3-10 seconds

Or, in Audacity, open audio file:

Select Analyze->Sound Finder
Adjust audio and silence signal dB and length until clips are between 3-10 seconds

Adjust Markers or Label Boundaries

In Audition:

Open Markers Tab
Adjust markers, removing silence and noise to make clip length between 3 to 10 seconds long

In Audition:

Adjust label boundaries, removing silence and noise to make clip length between 3 to 10 seconds long

Export Markers/Labels and WAVs

In Audition:

Select all markers in list
Select Export Selected Markers to CSV and save as Markers.csv
Select Preferences -> Media & Disk Cache and Untick Save Peak Files
Select Export Audio of Selected Range Markers with the following options:
- Check Use marker names in filenames
- Update Format to WAV PCM
- Update Sample Type 22050 Hz Mono, 16-bit
- Use folder wavs_export

Or, in Audacity:

Select Export multiple...
- Format: WAV
- Options: Signed 16-bit PCM
- Split files based on Labels
- Name files using Label/Track Name
- Use folder wavs_export
Select Export labels to Label Track.txt

Analyze WAVs with Signal to Noise Ratio Colab

run colabs/voice_dataset_SNR.ipynb
Clean or remove noisy files

Create Initial Transcriptions with STT

For Audition, using the exported Markers.csv and wavs folder run:

cd scripts
python wav_to_text.py audition

The script generates a new file, Markers_STT.csv.

For Audacity, using the exported Label Track.txt and wavs folder run:

cd scripts
python wav_to_text.py audacity

The script generates a new file, Label Track STT.csv.

Fine-tune Transcriptions

For Audition:

Delete all markers
Select Import Markers from File and select file with STT transcriptions: Markers_STT.csv
Fine-tune the Description field in Markers to exactly match the words spoken

For Audacity:

Open Label Track STT.txt in a text editor.
Fine-tune the Labels field in the text file to exactly match the words spoken

Export Markers (Audition only) and WAVs

For Audition:

Select all markers in list
Select Export Selected Markers to CSV and save as Markers.csv
Select Export Audio of Selected Range Markers with the following options:
- Check Use marker names in filenames
- Update Format to WAV PCM
- Update Sample Type 22050 Hz Mono, 16-bit
- Use folder wavs_export

For Audacity:

Select Export multiple...
- Format: WAV
- Options: Signed 16-bit PCM
- Split files based on Labels
- Name files using Label/Track Name
- Use folder wavs_export

Convert Markers(Audition) or Labels(Audacity) into LJSpeech format

Using the exported Markers.csv(Audition) or Label Track STT.txt (Audacity) and WAVs in wavs_export, scripts/markersfile_to_metadata.py will create a metadata.csv and folder of WAVs to train your TTS model:

For Audition:

python markersfile_to_metadata.py audition

For Audacity:

python markersfile_to_metadata.py audacity

Sentence Lengths

Run scripts/wavdurations2csv.sh to chart out sentence length and verify that you have a good distribution of WAV file lengths.

Other Utilities

Upsample WAV file

ffmpeg: ffmpeg resampy: resampy We tested three methods to upsample WAV files from 16,000 to 22,050 Hz. After reviewing the spectrograms, we selected ffmpeg for upsampling as it includes another 2 KHz of high end information when compared to resampy. scripts/resamplewav.sh

scripts/resamplewav.sh

References

Mozilla TTS: https://github.com/mozilla/TTS
Automating alignment, includes segment audio on silence, Google Speech API, and recognition alignment: https://github.com/carpedm20/multi-Speaker-tacotron-tensorflow#2-2-generate-korean-datasets
Pretraining on large synthetic corpuses and fine tuning on specific ones https://twitter.com/garygarywang
Datasheets for Datasets https://arxiv.org/abs/1803.09010

voice-dataset-creation
voice-dataset-creation copied to clipboard

Metadata

Voice Dataset Creation

Table of Contents

Create Your Own Voice Recordings

Requirements

Create a Text Corpus of Sentences

Speak and Record Sentences

Sentence Lengths

Create a Synthetic Voice Dataset

Requirements

Installation

Create a Text Corpus of Sentences

Generate Synthetic Voice Dataset

Sentence Lengths

Create Transcriptions for Existing Voice Recordings

Requirements

Installation

Fill out a Datasheet for the Voice Dataset

Mark the Speech

Adjust Markers or Label Boundaries

Export Markers/Labels and WAVs

Analyze WAVs with Signal to Noise Ratio Colab

Create Initial Transcriptions with STT

Fine-tune Transcriptions

Export Markers (Audition only) and WAVs

Convert Markers(Audition) or Labels(Audacity) into LJSpeech format

Sentence Lengths

Other Utilities

Upsample WAV file

References

← Metadata

Owner

Metadata

voice-dataset-creation voice-dataset-creation copied to clipboard

Metadata

Voice Dataset Creation

Table of Contents

Create Your Own Voice Recordings

Requirements

Create a Text Corpus of Sentences

Speak and Record Sentences

Sentence Lengths

Create a Synthetic Voice Dataset

Requirements

Installation

Create a Text Corpus of Sentences

Generate Synthetic Voice Dataset

Sentence Lengths

Create Transcriptions for Existing Voice Recordings

Requirements

Installation

Fill out a Datasheet for the Voice Dataset

Mark the Speech

Adjust Markers or Label Boundaries

Export Markers/Labels and WAVs

Analyze WAVs with Signal to Noise Ratio Colab

Create Initial Transcriptions with STT

Fine-tune Transcriptions

Export Markers (Audition only) and WAVs

Convert Markers(Audition) or Labels(Audacity) into LJSpeech format

Sentence Lengths

Other Utilities

Upsample WAV file

References

← Metadata

Owner

Metadata

voice-dataset-creation
voice-dataset-creation copied to clipboard