returnn
returnn copied to clipboard
Dummy Dataset for ASR purposes
For complicated custom_construction_network
or MultiStage
training such as in returnn-experiments # 60 it would be nice to have some dummy dataset for fast testing. With testing, I mean checking if the network works as intended for different (sub)epochs.
One could of course use the dataset on hand, such as LibriSpeech, Switchboard or Timit but (sub)epochs are long and can take several minutes.
I have 2 ideas right now.
1. Add option on Dataset
class to only use one seq per (sub)epoch
One would be find out a way how to use only 1 sequence per (sub)epoch from the dataset on hand. That could be set as an option on the dataset class and hopefully not require too many changes. I tried looking into it but couldn't exactly find out how to do that. Maybe need to spend some more time with the Dataset
.
2. Create DummyDataset
over which we have full control
One other way would be to have a DummyDataset
class where we have full control and can decide num_seq
, input_seq_len
,output_seq_len
, input_dim
, output_dim
and the Vocab
.
Here I did some progress but am not able to properly incorporate the Vocab
.
The dataset looks like this:
class DummyDataset2(GeneratingDataset):
"""
Some dummy data, which does not have any meaning.
If you want to have artificial data with some meaning, look at other datasets here.
The input are some dense data, the outputs are sparse.
"""
def __init__(self, input_dim, output_dim, num_seqs, seq_len=20, output_seq_len=8,
input_max_value=10.0, input_shift=None, input_scale=None, **kwargs):
"""
:param int|None input_dim:
:param int|dict[str,int|(int,int)|dict] output_dim:
:param int|float num_seqs:
:param int|dict[str,int] seq_len:
:param float input_max_value:
:param float|None input_shift:
:param float|None input_scale:
"""
super(DummyDataset2, self).__init__(input_dim=input_dim, output_dim=output_dim, num_seqs=num_seqs, **kwargs)
self.seq_len = seq_len
self.output_seq_len = output_seq_len
self.input_max_value = input_max_value
if input_shift is None:
input_shift = -input_max_value / 2.0
self.input_shift = input_shift
if input_scale is None:
input_scale = 1.0 / self.input_max_value
self.input_scale = input_scale
self.vocab = self.create_vocab(output_dim)
def create_vocab(self, output_dim):
import itertools
chlist = 'ABCDEFG'
base = len(chlist)
power = 0
tmp = output_dim
while(tmp != 0):
tmp = tmp // base
power += 1
assert base > power, "The choosen output_dim is too big"
return [val for val in itertools.product(chlist, repeat=power)][:output_dim]
def generate_seq(self, seq_idx):
"""
:param int seq_idx:
:rtype: DatasetSeq
"""
seq_len = self.seq_len
output_seq_len = self.output_seq_len
i1 = seq_idx
i2 = i1 + seq_len * self.num_inputs
features = numpy.array([((i % self.input_max_value) + self.input_shift) * self.input_scale
for i in range(i1, i2)]).reshape((seq_len, self.num_inputs))
i1, i2 = i2, i2 + output_seq_len
targets = {"classes": numpy.array([i % self.num_outputs["classes"][0]
for i in range(i1, i2)], dtype="int32")}
targets["raw"] = numpy.array(["".join(self.vocab[t]) for t in targets["classes"]],
dtype="object")
# print("printing", targets)
return DatasetSeq(
targets=targets,
features=features,
seq_idx=seq_idx,
)
Now when I use it in this way:
"""
DummyDataset in RETURNN automatically downloads the data via `nltk`,
so no preparation is needed.
This is useful for demos/tests.
"""
from __future__ import annotations
from typing import Dict, Any
from returnn.config import get_global_config
# from .librispeech.vocabs import bpe1k, bpe10k
from ..interface import DatasetConfig, VocabConfig
config = get_global_config()
# num_outputs = {'data': (40*2, 2), 'classes': (61, 1)}
# num_inputs = num_outputs["data"][0]
# _num_seqs = {'train': 144, 'dev': 16}
class DummyDataset(DatasetConfig):
def __init__(self, audio_dim=50, output_dim=30, seq_len=30, output_seq_len=12, num_seqs=6, debug_mode=None):
super(DummyDataset, self).__init__()
if debug_mode is None:
debug_mode = config.typed_dict.get("debug_mode", False)
self.audio_dim = audio_dim
self.output_dim = output_dim
self.seq_len = seq_len
self.output_seq_len = output_seq_len
self.num_seqs = num_seqs
self.debug_mode = debug_mode
def get_extern_data(self) -> Dict[str, Dict[str, Any]]:
return {
"data": {"sparse": True, "dim": self.audio_dim},
"classes": {"sparse": True, "dim": self.output_dim},
}
def get_train_dataset(self) -> Dict[str, Any]:
return self.get_dataset("train")
def get_eval_datasets(self) -> Dict[str, Dict[str, Any]]:
return {
"dev": self.get_dataset("dev"),
"devtrain": self.get_dataset("train")}
def get_dataset(self, key, subset=None):
assert key in {"train", "dev"}
print(f"Using {key} dataset!")
return {
"class": "DummyDataset2",
"input_dim": self.audio_dim,
"output_dim": self.output_dim,
"seq_len": self.seq_len,
"output_seq_len": self.output_seq_len,
"num_seqs": self.num_seqs,
# "input_base": 10
}
the dump-dataset.py
function doesn't output the raw
/text part of the targets:
Train data:
input: 50 x 1
output: {'classes': (30, 1), 'data': (50, 2)}
DummyDataset2, sequences: 6, frames: unknown
Epoch: 1
Dataset keys: ['data', 'classes']
Dataset target keys: ['classes']
Dump to stdout
seq 0/6 (16.67%) (0:00:00) tag: seq-0
seq 0/6 (16.67%) (0:00:00) data: array([[-0.5, -0.4, -0.3, ..., 0.2, 0.3, 0.4],
[-0.5, -0.4, -0.3, ..., 0.2, 0.3, 0.4],
[-0.5, -0.4, -0.3, ..., 0.2, 0.3, 0.4],
...,
[-0.5, -0.4, -0.3, ..., 0.2, 0.3, 0.4],
[-0.5, -0.4, -0.3, ..., 0.2, 0.3, 0.4],
[-0.5, -0.4, -0.3, ..., 0.2..., shape=(30, 50)
seq 0 target 'classes': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype=int32), shape=(12,)
seq 1/6 (33.33%) (0:00:00) tag: seq-1
seq 1/6 (33.33%) (0:00:00) data: array([[-0.4, -0.3, -0.2, ..., 0.3, 0.4, -0.5],
[-0.4, -0.3, -0.2, ..., 0.3, 0.4, -0.5],
[-0.4, -0.3, -0.2, ..., 0.3, 0.4, -0.5],
...,
[-0.4, -0.3, -0.2, ..., 0.3, 0.4, -0.5],
[-0.4, -0.3, -0.2, ..., 0.3, 0.4, -0.5],
[-0.4, -0.3, -0.2, ..., 0.3..., shape=(30, 50)
seq 1 target 'classes': array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], dtype=int32), shape=(12,)
seq 2/6 (50.00%) (0:00:00) tag: seq-2
seq 2/6 (50.00%) (0:00:00) data: array([[-0.3, -0.2, -0.1, ..., 0.4, -0.5, -0.4],
[-0.3, -0.2, -0.1, ..., 0.4, -0.5, -0.4],
[-0.3, -0.2, -0.1, ..., 0.4, -0.5, -0.4],
...,
[-0.3, -0.2, -0.1, ..., 0.4, -0.5, -0.4],
[-0.3, -0.2, -0.1, ..., 0.4, -0.5, -0.4],
[-0.3, -0.2, -0.1, ..., 0.4..., shape=(30, 50)
seq 2 target 'classes': array([ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], dtype=int32), shape=(12,)
seq 3/6 (66.67%) (0:00:00) tag: seq-3
seq 3/6 (66.67%) (0:00:00) data: array([[-0.2, -0.1, 0. , ..., -0.5, -0.4, -0.3],
[-0.2, -0.1, 0. , ..., -0.5, -0.4, -0.3],
[-0.2, -0.1, 0. , ..., -0.5, -0.4, -0.3],
...,
[-0.2, -0.1, 0. , ..., -0.5, -0.4, -0.3],
[-0.2, -0.1, 0. , ..., -0.5, -0.4, -0.3],
[-0.2, -0.1, 0. , ..., -0.5..., shape=(30, 50)
seq 3 target 'classes': array([ 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], dtype=int32), shape=(12,)
seq 4/6 (83.33%) (0:00:00) tag: seq-4
seq 4/6 (83.33%) (0:00:00) data: array([[-0.1, 0. , 0.1, ..., -0.4, -0.3, -0.2],
[-0.1, 0. , 0.1, ..., -0.4, -0.3, -0.2],
[-0.1, 0. , 0.1, ..., -0.4, -0.3, -0.2],
...,
[-0.1, 0. , 0.1, ..., -0.4, -0.3, -0.2],
[-0.1, 0. , 0.1, ..., -0.4, -0.3, -0.2],
[-0.1, 0. , 0.1, ..., -0.4..., shape=(30, 50)
seq 4 target 'classes': array([ 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], dtype=int32), shape=(12,)
seq 5/6 (100.00%) (0:00:00) tag: seq-5
seq 5/6 (100.00%) (0:00:00) data: array([[ 0. , 0.1, 0.2, ..., -0.3, -0.2, -0.1],
[ 0. , 0.1, 0.2, ..., -0.3, -0.2, -0.1],
[ 0. , 0.1, 0.2, ..., -0.3, -0.2, -0.1],
...,
[ 0. , 0.1, 0.2, ..., -0.3, -0.2, -0.1],
[ 0. , 0.1, 0.2, ..., -0.3, -0.2, -0.1],
[ 0. , 0.1, 0.2, ..., -0.3..., shape=(30, 50)
seq 5 target 'classes': array([ 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], dtype=int32), shape=(12,)
Done. Total time 0:00:00.0250. More seqs which we did not dumped: False
Seq-length 'data' Stats:
6 seqs
Mean: 30.0
Std dev: 0.0
Min/max: 30 / 30
Seq-length 'classes' Stats:
6 seqs
Mean: 12.0
Std dev: 0.0
Min/max: 12 / 12
EDIT
The LibriSpeech
dataset outputs the following instead
Dataset keys: ['classes', 'data', 'orth', 'raw']
Dataset target keys: ['classes', 'orth', 'raw']
Dump to stdout
seq 0/105 (0.95%) (0:00:00) tag: train-clean-360-5448-19208-0055
seq 0/105 (0.95%) (0:00:00) data: array([[-0.8777502 , -0.3416127 , 0.18146478, ..., -0.37615722,
1.0572844 , 1.8435454 ],
[-0.8748609 , -0.16077241, 0.16136283, ..., -0.5366785 ,
0.07934943, 0.6256051 ],
[-0.8743404 , -0.19774717, 0.3236745 , ..., -0.57161534,
-0.26601306, 0.04986066..., shape=(200, 50)
seq 0 target 'classes': array([ 2, 885, 249, 232, 43, 49, 463], dtype=int32), shape=(7,) ('THE POR@@ TER DID NOT ST@@ IR')
seq 0 target 'orth': array([84, 72, 69, 32, 80, 79, 82, 84, 69, 82, 32, 68, 73, 68, 32, 78, 79,
84, 32, 83, 84, 73, 82], dtype=uint8), shape=(23,) ('THE PORTER DID NOT STIR')
seq 0 target 'raw': array('THE PORTER DID NOT STIR', dtype=object), shape=()
seq 1/105 (1.90%) (0:00:00) tag: train-other-500-1374-133833-0046
seq 1/105 (1.90%) (0:00:00) data: array([[-1.0181153 , -0.3433207 , -0.5385984 , ..., 0.26934746,
-0.24952461, -0.19383669],
[-1.031907 , -0.39967808, -0.79773605, ..., -0.19201769,
-3.026414 , -1.4356858 ],
[-1.0224752 , -0.38677597, -1.0231327 , ..., -0.01516061,
-1.7981621 , 0.44004616..., shape=(192, 50)
seq 1 target 'classes': array([221, 43, 68, 197, 762, 44, 335, 71, 47, 474], dtype=int32), shape=(10,) ('ITS NOT MY FA@@ UL@@ T JO@@ SI@@ A@@ H')
seq 1 target 'orth': array([73, 84, 83, 32, 78, 79, 84, 32, 77, 89, 32, 70, 65, 85, 76, 84, 32,
74, 79, 83, 73, 65, 72], dtype=uint8), shape=(23,) ('ITS NOT MY FAULT JOSIAH')
seq 1 target 'raw': array('ITS NOT MY FAULT JOSIAH', dtype=object), shape=()
seq 2/105 (2.86%) (0:00:00) tag: train-clean-100-250-142286-0025
seq 2/105 (2.86%) (0:00:00) data: array([[-1.0639772 , -0.43535757, 0.9979144 , ..., 0.44950628,
-0.10635232, -1.2175717 ],
[-0.99800795, -0.9328595 , 1.0114769 , ..., 0.43517488,
0.40714452, -1.3986644 ],
[-0.9970233 , -2.2936897 , -0.0412274 , ..., -0.7998198 ,
-1.4338555 , -1.8155696 ..., shape=(209, 50)
seq 2 target 'classes': array([406, 591, 52, 169, 11, 30, 2, 356, 97], dtype=int32), shape=(9,) ('TOO CRO@@ W@@ DED THAT IS THE WOR@@ ST')
seq 2 target 'orth': array([84, 79, 79, 32, 67, 82, 79, 87, 68, 69, 68, 32, 84, 72, 65, 84, 32,
73, 83, 32, 84, 72, 69, 32, 87, 79, 82, 83, 84], dtype=uint8), shape=(29,) ('TOO CROWDED THAT IS THE WORST')
seq 2 target 'raw': array('TOO CROWDED THAT IS THE WORST', dtype=object), shape=()
seq 3/105 (3.81%) (0:00:00) tag: train-clean-360-6119-48032-0040
seq 3/105 (3.81%) (0:00:00) data: array([[-0.9161725 , -0.3515298 , -0.92236227, ..., 0.4479296 ,
1.3222326 , -0.60767865],
[-0.94417214, 0.02018955, -1.0759346 , ..., 1.0356908 ,
0.14922525, -0.18327384],
[-0.9349811 , -0.1103144 , -0.9856503 , ..., -0.6657466 ,
-1.6779268 , -1.789029 ..., shape=(213, 50)
seq 3 target 'classes': array([ 84, 528, 192, 100, 214, 724, 229], dtype=int32), shape=(7,) ('UN@@ NAT@@ UR@@ AL NO@@ TIONS ABOUT')
seq 3 target 'orth': array([85, 78, 78, 65, 84, 85, 82, 65, 76, 32, 78, 79, 84, 73, 79, 78, 83,
32, 65, 66, 79, 85, 84], dtype=uint8), shape=(23,) ('UNNATURAL NOTIONS ABOUT')
seq 3 target 'raw': array('UNNATURAL NOTIONS ABOUT', dtype=object), shape=()
seq 4/105 (4.76%) (0:00:00) tag: train-other-500-5840-54188-0025
seq 4/105 (4.76%) (0:00:00) data: array([[-0.6875868 , 0.55698854, -1.3179873 , ..., 0.49193764,
1.1466433 , -0.1786345 ],
[-0.69063985, 0.13344799, -1.1449655 , ..., 0.11755662,
-0.45420724, 0.569198 ],
[-0.68622375, 0.31605458, -1.1161832 , ..., -0.22542922,
1.346604 , -0.05708468..., shape=(231, 50)
seq 4 target 'classes': array([ 93, 82, 67, 770, 858, 90, 418, 127], dtype=int32), shape=(8,) ('FROM SE@@ V@@ ENTY FOUR CON@@ CER@@ TS')
seq 4 target 'orth': array([70, 82, 79, 77, 32, 83, 69, 86, 69, 78, 84, 89, 32, 70, 79, 85, 82,
32, 67, 79, 78, 67, 69, 82, 84, 83], dtype=uint8), shape=(26,) ('FROM SEVENTY FOUR CONCERTS')
seq 4 target 'raw': array('FROM SEVENTY FOUR CONCERTS', dtype=object), shape=()
seq 5/105 (5.71%) (0:00:00) tag: train-other-500-3319-171003-0063
seq 5/105 (5.71%) (0:00:00) data: array([[-0.8110962 , -0.14148454, -0.4258629 , ..., 0.23668024,
-0.15315071, 0.09615742],
[-0.8111036 , -0.15401922, -0.8482575 , ..., -0.20874035,
-0.6203824 , -0.24077122],
[-0.8120254 , -0.12016892, -0.6180197 , ..., 0.05419957,
0.6076831 , 1.8362247 ..., shape=(223, 50)
seq 5 target 'classes': array([ 2, 518, 108, 196, 669, 315, 79, 340, 33, 18], dtype=int32), shape=(10,) ('THE SAME EX@@ PRE@@ SSION COMP@@ LE@@ X@@ I@@ ON')
seq 5 target 'orth': array([84, 72, 69, 32, 83, 65, 77, 69, 32, 69, 88, 80, 82, 69, 83, 83, 73,
79, 78, 32, 67, 79, 77, 80, 76, 69, 88, 73, 79, 78], dtype=uint8), shape=(30,) ('THE SAME EXPRESSION COMPLEXION')
seq 5 target 'raw': array('THE SAME EXPRESSION COMPLEXION', dtype=object), shape=()
seq 6/105 (6.67%) (0:00:00) tag: train-other-500-1767-142932-0000
seq 6/105 (6.67%) (0:00:00) data: array([[-0.86693347, -5.3405786 , -0.50791055, ..., -2.4631803 ,
-2.9376633 , 0.13745485],
[-0.8714278 , -3.3832295 , 0.8302062 , ..., 0.01946406,
1.518743 , 1.2339076 ],
[-0.8331261 , -1.1863018 , -0.06199729, ..., -2.3934097 ,
-0.24568689, 0.8423811 ..., shape=(229, 50)
seq 6 target 'classes': array([ 6, 609, 952, 3, 56, 289, 61], dtype=int32), shape=(7,) ('A NEW FRIEND AND AN OLD ONE')
seq 6 target 'orth': array([65, 32, 78, 69, 87, 32, 70, 82, 73, 69, 78, 68, 32, 65, 78, 68, 32,
65, 78, 32, 79, 76, 68, 32, 79, 78, 69], dtype=uint8), shape=(27,) ('A NEW FRIEND AND AN OLD ONE')
seq 6 target 'raw': array('A NEW FRIEND AND AN OLD ONE', dtype=object), shape=()
Done. Total time 0:00:00.3895. More seqs which we did not dumped: True
Seq-length 'classes' Stats:
6 seqs
Mean: 7.636363636363637
Std dev: 2.0123585110162416
Min/max: 4 / 10
Seq-length 'data' Stats:
6 seqs
Mean: 206.0909090909091
Std dev: 19.33651536462247
Min/max: 165 / 231
Seq-length 'orth' Stats:
6 seqs
Mean: 25.18181818181818
Std dev: 6.685373203945541
Min/max: 13 / 40
Seq-length 'raw' Stats:
6 seqs
Mean: 1.0
Std dev: 0.0
Min/max: 1 / 1
Quitting
I think also you want to have a separate tool like tools/test-network.py
or so?
The dummy dataset would just be created such that it matches extern_data
(or num_inputs
/num_outputs
for old configs).
The dummy dataset above is the same as DummyDatasetMultipleSequenceLength
, only that I added a Vocab
for the target labels.
I think also you want to have a separate tool like tools/test-network.py or so?
That would be the cleanest solution actually. tools/test-network.py
should look as similar as possible to the normal training only that the dataset is much smaller(one sequence per subepoch). And we can print some more information on the way too.
The dummy dataset above is the same as
DummyDatasetMultipleSequenceLength
, only that I added aVocab
for the target labels.
Why do you need the labels (vocab) in the dataset for testing?
And if it is the same otherwise, why not extend DummyDatasetMultipleSequenceLength
instead?
I think also you want to have a separate tool like tools/test-network.py or so?
That would be the cleanest solution actually.
Yes, I thought that was the intention. But in any case, such tool would then internally make use of such a dataset.
(But I think DummyDatasetMultipleSequenceLength
or maybe also StaticDataset
are already fine.)
tools/test-network.py
should look as similar as possible to the normal training
Yes, you would probably just call test-network.py <config> <other-args>
instead of rnn.py <config> <other-args>
.
only that the dataset is much smaller(one sequence per subepoch).
It should be a couple of seqs, such that batch size is >1 (with bs 1, you sometimes can hide some bugs which occur only with bs >1). And also you want that it runs for more than one step per (sub)epoch, because some bugs might only be triggered in step 2 or later (e.g. keep-over-epoch logic of hidden state or so).
And we can print some more information on the way too.
Like what?