speechless
speechless copied to clipboard
Speech-to-text based on wav2letter built for transfer learning
speechless
Speech recognizer based on wav2letter architecture built with Keras.
Supports CTC loss, KenLM and greedy decoding and transfer learning between different languages. ASG loss is currently not supported.
Training for English with the 1000h LibriSpeech corpus works out of the box, while training for the German language requires downloading data manually.
Installation
Python 3.4+ and TensorFlow are required.
pip3 install [email protected]:JuliusKunze/speechless.git
will install speechless together with minimal requirements.
If you want to use the KenLM decoder, this modified version of TensorFlow needs to be installed first.
You need to have an audio backend available, for example ffmpeg (run brew install ffmpeg
on Mac OS).
Training
from speechless.configuration import Configuration
Configuration.minimal_english().train_from_beginning()
will automatically download a small English example corpus (337MB), train a net based on it while giving you updated loss and predictions. If you use a strong consumer-grade GPU, you should observe training predictions become similar to the input after ~12h, e. g.
Expected: "just thrust and parry and victory to the stronger"
Predicted: "jest thcrus and pary and bettor o the stronter"
Errors: 10 letters (20%), 6 words (67%), loss: 37.19.
All data (corpus, nets, logs) will be stored in ~/speechless-data
.
This directory can be changed:
from pathlib import Path
from speechless import configuration
from speechless.configuration import Configuration, DataDirectories
configuration.default_data_directories = DataDirectories(Path("/your/data/path"))
Configuration.minimal_english().train_from_beginning()
To download and train on the full 1000h LibriSpeech corpus, replace mininal_english
with english
.
main.py
contains various other functions that were executed to train and use models.
If you want completely flexible where data is saved and loaded from,
you should not use Configuration
at all but instead use the code from net
, corpus
, german_corpus
, english_corpus
and recording
directly.
Loading
By default, all trained models are stored in the ~/speechless-data/nets
directory.
You use models from here by downloading them into this folder (keep the subfolder from Google Drive).
To load a such a model use load_best_english_model
or load_best_german_model
e. g.
from speechless.configuration import Configuration
wav2letter = Configuration.german().load_best_german_model()
If the language was originally trained with a different character set (e. g. a corpus of another language),
specifying the allowed_characters_for_loaded_model
parameter of load_model
still allows you to use that model for training,
thereby allowing transfer learning.
Recording
You can record your own audio with a microphone and get a prediction for it:
# ... after loading a model, see above
from speechless.recording import record_plot_and_save
label = record_plot_and_save()
print(wav2letter.predict(label))
Three seconds of silence will end the recording and silence will be truncated.
By default, this will generate a wav
-file and a spectrogram plot in ~/speechless-data/recordings
.
Testing
Given that you downloaded the German corpus into the corpus directory, you can evaluate the German model on the test set:
german.test_model_grouped_by_loaded_corpus_name(wav2letter)
Testing will write to the standard output and a log to ~/speechless-data/test-results
by default.
Plotting
Plotting labeled audio examples from the corpus like this one here can be done with LabeledExamplePlotter.save_spectrogram
.
German & Sections
For some German datasets, it is possible to retrieve which word is said at which point of time, allowing to extract labeled sections, e. g.:
from speechless.configuration import Configuration
german = Configuration.german()
wav2letter = german.load_best_german_model()
example = german.corpus.examples[0]
sections = example.sections()
for section in sections:
print(wav2letter.test_and_predict(section))
If you need to access the section labels only (e. g. for filtering for particular words),
use example.positional_label.labels
(which is faster because no audio data needs to be sliced).
If no positional info is available, sections
and positional_label
are None
.