LSTM text generation by word. Used to generate lyrics from a corpus of a music genre.
And the update: working with word embeddings
How to install
The first thing is to clone this same repo and cd
to it:
git clone
cd lstm_lyrics
If you want to test an experimental branch:
git clone -b <experimental_branch>
Install dependencies
If necessary, install virtualenv
pip install virtualenv
Then create an environment and install the dependencies
virtualenv env --python=python3.6
source env/bin/activate
pip install -r requirements.txt
To start the training:
To train the one-hot encoded version:
python3 corpora/corpus_banda.txt examples.txt vocabulary.txt
- corpora/corpus_banda.txt: points to the corpus you want to train from
- examples.txt: is the file where the example text is going to be written after every epoch
- vocabulary.txt: is a file where all the words used by the network is written, one per line. It is used by
Or the version using Word Embedding (words to vectors):
python3 corpora/corpus_reggeaton.txt examples_reggeaton.txt
- corpora/corpus_reggeaton.txt: points to the corpus you want to train from
- examples_reggeaton.txt: is the file where the example text is going to be written after every epoch
To generate text from a trained model:
The model and weights will be saved in the directory ./checkpoints/
We can use these files to generate text from a given seed.
The script
is used for this. It used argparse to manage the arguments.
$ python -h
Using TensorFlow backend.
usage: [-h] [-v VOCABULARY] [-n NETWORK] [-c CORPUS] [-s SEED]
Generate lyrics using the weights of a trained network.
optional arguments:
-h, --help show this help message and exit
The path of vocabulary used by the network.
-n NETWORK, --network NETWORK
The path of the trained network.=
-s SEED, --seed SEED The seed used to generate the text. All the words
should be part of the vocabulary. Only the last
SEQUENCE_LENGTH words are considered
The length of the sequence used for the training.
Default is 10
The value of diversity. Usually a number between 0.1
and 2 Default is 0.5
-q QUANTITY, --quantity QUANTITY
Quantity of words to generate. Default is 50
At least -v and -n are needed to generate text.
For instance
$ python -v reggeaton_vocabulary.txt -n checkpoints/LSTM_LYRICS-epoch009-words12952-sequence10-minfreq10-loss2.1511-acc0.6280-val_loss2.9192-val_acc0.5508 -s "perrea mami perrea dale duro" -q 60 -d 0.7
Using TensorFlow backend.
Summary of the Network:
Layer (type) Output Shape Param #
bidirectional_1 (Bidirection (None, 256) 13394944
dropout_1 (Dropout) (None, 256) 0
dense_1 (Dense) (None, 12952) 3328664
activation_1 (Activation) (None, 12952) 0
Total params: 16,723,608
Trainable params: 16,723,608
Non-trainable params: 0
Validating that all the words in the seed are part of the vocabulary:
perrea ✓ in vocabulary
mami ✓ in vocabulary
perrea ✓ in vocabulary
dale ✓ in vocabulary
duro ✓ in vocabulary
Seed is correct.
----- Generating text
----- Diversity:0.7
----- Generating with seed:
"perrea mami perrea dale duro perrea mami perrea dale duro
perrea mami perrea dale duro perrea mami perrea dale duro
que no se siente bien
que tengo la pista en mi cuerpo
y no se porque tú eres la misma hecho
que no te tengo a mi a mi me encanta tu culpa
si tu me dices tu me llamas si te sientes solita
y es que vuelvo a ver
For now, the generated text is only printed to the console.
Text classifier
The objective is to create a neural network to classify real text taken from a corpus vs randomly generated text. The idea is to increase the quality of the generated lyrics pre-filtering the lines that look a lot like randomly chosen words.
To create the training set
python3 utils/ corpora/corpus_banda.txt banda_subset.txt random_banda.txt
How to contribute
Be sure to check that your changes did not include any flake8 error:
$ flake8