returnn icon indicating copy to clipboard operation
returnn copied to clipboard

Dummy Dataset for ASR purposes

Open jotix16 opened this issue 3 years ago • 3 comments

For complicated custom_construction_network or MultiStage training such as in returnn-experiments # 60 it would be nice to have some dummy dataset for fast testing. With testing, I mean checking if the network works as intended for different (sub)epochs.

One could of course use the dataset on hand, such as LibriSpeech, Switchboard or Timit but (sub)epochs are long and can take several minutes.

I have 2 ideas right now.

1. Add option on Dataset class to only use one seq per (sub)epoch
One would be find out a way how to use only 1 sequence per (sub)epoch from the dataset on hand. That could be set as an option on the dataset class and hopefully not require too many changes. I tried looking into it but couldn't exactly find out how to do that. Maybe need to spend some more time with the Dataset.

2. Create DummyDataset over which we have full control
One other way would be to have a DummyDataset class where we have full control and can decide num_seq, input_seq_len,output_seq_len, input_dim, output_dim and the Vocab.

Here I did some progress but am not able to properly incorporate the Vocab.

The dataset looks like this:

class DummyDataset2(GeneratingDataset):
  """
  Some dummy data, which does not have any meaning.
  If you want to have artificial data with some meaning, look at other datasets here.
  The input are some dense data, the outputs are sparse.
  """

  def __init__(self, input_dim, output_dim, num_seqs, seq_len=20, output_seq_len=8,
               input_max_value=10.0, input_shift=None, input_scale=None, **kwargs):
    """
    :param int|None input_dim:
    :param int|dict[str,int|(int,int)|dict] output_dim:
    :param int|float num_seqs:
    :param int|dict[str,int] seq_len:
    :param float input_max_value:
    :param float|None input_shift:
    :param float|None input_scale:
    """
    super(DummyDataset2, self).__init__(input_dim=input_dim, output_dim=output_dim, num_seqs=num_seqs, **kwargs)
    self.seq_len = seq_len
    self.output_seq_len = output_seq_len
    self.input_max_value = input_max_value
    if input_shift is None:
      input_shift = -input_max_value / 2.0
    self.input_shift = input_shift
    if input_scale is None:
      input_scale = 1.0 / self.input_max_value
    self.input_scale = input_scale
    self.vocab = self.create_vocab(output_dim)

  def create_vocab(self, output_dim):
    import itertools
    chlist = 'ABCDEFG'
    base = len(chlist)
    power = 0
    tmp = output_dim
    while(tmp != 0):
      tmp = tmp // base
      power += 1
    assert base > power, "The choosen output_dim is too big"
    return [val for val in itertools.product(chlist, repeat=power)][:output_dim]

  def generate_seq(self, seq_idx):
    """
    :param int seq_idx:
    :rtype: DatasetSeq
    """
    seq_len = self.seq_len
    output_seq_len = self.output_seq_len
    i1 = seq_idx
    i2 = i1 + seq_len * self.num_inputs
    features = numpy.array([((i % self.input_max_value) + self.input_shift) * self.input_scale
                            for i in range(i1, i2)]).reshape((seq_len, self.num_inputs))
    i1, i2 = i2, i2 + output_seq_len
    targets = {"classes": numpy.array([i % self.num_outputs["classes"][0]
                                      for i in range(i1, i2)], dtype="int32")}
    targets["raw"] = numpy.array(["".join(self.vocab[t]) for t in targets["classes"]],
                                 dtype="object")
    # print("printing", targets)
    return DatasetSeq(
      targets=targets,
      features=features,
      seq_idx=seq_idx,
    )

Now when I use it in this way:

"""
DummyDataset in RETURNN automatically downloads the data via `nltk`,
so no preparation is needed.
This is useful for demos/tests.
"""

from __future__ import annotations
from typing import Dict, Any
from returnn.config import get_global_config
# from .librispeech.vocabs import bpe1k, bpe10k

from ..interface import DatasetConfig, VocabConfig


config = get_global_config()

# num_outputs = {'data': (40*2, 2), 'classes': (61, 1)}
# num_inputs = num_outputs["data"][0]
# _num_seqs = {'train': 144, 'dev': 16}


class DummyDataset(DatasetConfig):
  def __init__(self, audio_dim=50, output_dim=30, seq_len=30, output_seq_len=12, num_seqs=6, debug_mode=None):
    super(DummyDataset, self).__init__()
    if debug_mode is None:
      debug_mode = config.typed_dict.get("debug_mode", False)
    self.audio_dim = audio_dim
    self.output_dim = output_dim
    self.seq_len = seq_len
    self.output_seq_len = output_seq_len
    self.num_seqs = num_seqs
    self.debug_mode = debug_mode

  def get_extern_data(self) -> Dict[str, Dict[str, Any]]:
    return {
      "data": {"sparse": True, "dim": self.audio_dim},
      "classes": {"sparse": True, "dim": self.output_dim},
    }

  def get_train_dataset(self) -> Dict[str, Any]:
    return self.get_dataset("train")

  def get_eval_datasets(self) -> Dict[str, Dict[str, Any]]:
    return {
      "dev": self.get_dataset("dev"),
      "devtrain": self.get_dataset("train")}

  def get_dataset(self, key, subset=None):
    assert key in {"train", "dev"}
    print(f"Using {key} dataset!")
    return {
      "class": "DummyDataset2",
      "input_dim": self.audio_dim,
      "output_dim": self.output_dim,
      "seq_len": self.seq_len,
      "output_seq_len": self.output_seq_len,
      "num_seqs": self.num_seqs,
      # "input_base": 10
    }

the dump-dataset.py function doesn't output the raw/text part of the targets:

Train data:
  input: 50 x 1
  output: {'classes': (30, 1), 'data': (50, 2)}
  DummyDataset2, sequences: 6, frames: unknown
Epoch: 1
Dataset keys: ['data', 'classes']
Dataset target keys: ['classes']
Dump to stdout
seq 0/6 (16.67%) (0:00:00) tag: seq-0
seq 0/6 (16.67%) (0:00:00) data: array([[-0.5, -0.4, -0.3, ...,  0.2,  0.3,  0.4],
       [-0.5, -0.4, -0.3, ...,  0.2,  0.3,  0.4],
       [-0.5, -0.4, -0.3, ...,  0.2,  0.3,  0.4],
       ...,
       [-0.5, -0.4, -0.3, ...,  0.2,  0.3,  0.4],
       [-0.5, -0.4, -0.3, ...,  0.2,  0.3,  0.4],
       [-0.5, -0.4, -0.3, ...,  0.2..., shape=(30, 50)
seq 0 target 'classes': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11], dtype=int32), shape=(12,)
seq 1/6 (33.33%) (0:00:00) tag: seq-1
seq 1/6 (33.33%) (0:00:00) data: array([[-0.4, -0.3, -0.2, ...,  0.3,  0.4, -0.5],
       [-0.4, -0.3, -0.2, ...,  0.3,  0.4, -0.5],
       [-0.4, -0.3, -0.2, ...,  0.3,  0.4, -0.5],
       ...,
       [-0.4, -0.3, -0.2, ...,  0.3,  0.4, -0.5],
       [-0.4, -0.3, -0.2, ...,  0.3,  0.4, -0.5],
       [-0.4, -0.3, -0.2, ...,  0.3..., shape=(30, 50)
seq 1 target 'classes': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12], dtype=int32), shape=(12,)
seq 2/6 (50.00%) (0:00:00) tag: seq-2
seq 2/6 (50.00%) (0:00:00) data: array([[-0.3, -0.2, -0.1, ...,  0.4, -0.5, -0.4],
       [-0.3, -0.2, -0.1, ...,  0.4, -0.5, -0.4],
       [-0.3, -0.2, -0.1, ...,  0.4, -0.5, -0.4],
       ...,
       [-0.3, -0.2, -0.1, ...,  0.4, -0.5, -0.4],
       [-0.3, -0.2, -0.1, ...,  0.4, -0.5, -0.4],
       [-0.3, -0.2, -0.1, ...,  0.4..., shape=(30, 50)
seq 2 target 'classes': array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13], dtype=int32), shape=(12,)
seq 3/6 (66.67%) (0:00:00) tag: seq-3
seq 3/6 (66.67%) (0:00:00) data: array([[-0.2, -0.1,  0. , ..., -0.5, -0.4, -0.3],
       [-0.2, -0.1,  0. , ..., -0.5, -0.4, -0.3],
       [-0.2, -0.1,  0. , ..., -0.5, -0.4, -0.3],
       ...,
       [-0.2, -0.1,  0. , ..., -0.5, -0.4, -0.3],
       [-0.2, -0.1,  0. , ..., -0.5, -0.4, -0.3],
       [-0.2, -0.1,  0. , ..., -0.5..., shape=(30, 50)
seq 3 target 'classes': array([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14], dtype=int32), shape=(12,)
seq 4/6 (83.33%) (0:00:00) tag: seq-4
seq 4/6 (83.33%) (0:00:00) data: array([[-0.1,  0. ,  0.1, ..., -0.4, -0.3, -0.2],
       [-0.1,  0. ,  0.1, ..., -0.4, -0.3, -0.2],
       [-0.1,  0. ,  0.1, ..., -0.4, -0.3, -0.2],
       ...,
       [-0.1,  0. ,  0.1, ..., -0.4, -0.3, -0.2],
       [-0.1,  0. ,  0.1, ..., -0.4, -0.3, -0.2],
       [-0.1,  0. ,  0.1, ..., -0.4..., shape=(30, 50)
seq 4 target 'classes': array([ 4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15], dtype=int32), shape=(12,)
seq 5/6 (100.00%) (0:00:00) tag: seq-5
seq 5/6 (100.00%) (0:00:00) data: array([[ 0. ,  0.1,  0.2, ..., -0.3, -0.2, -0.1],
       [ 0. ,  0.1,  0.2, ..., -0.3, -0.2, -0.1],
       [ 0. ,  0.1,  0.2, ..., -0.3, -0.2, -0.1],
       ...,
       [ 0. ,  0.1,  0.2, ..., -0.3, -0.2, -0.1],
       [ 0. ,  0.1,  0.2, ..., -0.3, -0.2, -0.1],
       [ 0. ,  0.1,  0.2, ..., -0.3..., shape=(30, 50)
seq 5 target 'classes': array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16], dtype=int32), shape=(12,)
Done. Total time 0:00:00.0250. More seqs which we did not dumped: False
Seq-length 'data' Stats:
  6 seqs
  Mean: 30.0
  Std dev: 0.0
  Min/max: 30 / 30
Seq-length 'classes' Stats:
  6 seqs
  Mean: 12.0
  Std dev: 0.0
  Min/max: 12 / 12

EDIT
The LibriSpeech dataset outputs the following instead

Dataset keys: ['classes', 'data', 'orth', 'raw']
Dataset target keys: ['classes', 'orth', 'raw']
Dump to stdout
seq 0/105 (0.95%) (0:00:00) tag: train-clean-360-5448-19208-0055
seq 0/105 (0.95%) (0:00:00) data: array([[-0.8777502 , -0.3416127 ,  0.18146478, ..., -0.37615722,
         1.0572844 ,  1.8435454 ],
       [-0.8748609 , -0.16077241,  0.16136283, ..., -0.5366785 ,
         0.07934943,  0.6256051 ],
       [-0.8743404 , -0.19774717,  0.3236745 , ..., -0.57161534,
        -0.26601306,  0.04986066..., shape=(200, 50)
seq 0 target 'classes': array([  2, 885, 249, 232,  43,  49, 463], dtype=int32), shape=(7,) ('THE POR@@ TER DID NOT ST@@ IR')
seq 0 target 'orth': array([84, 72, 69, 32, 80, 79, 82, 84, 69, 82, 32, 68, 73, 68, 32, 78, 79,
       84, 32, 83, 84, 73, 82], dtype=uint8), shape=(23,) ('THE PORTER DID NOT STIR')
seq 0 target 'raw': array('THE PORTER DID NOT STIR', dtype=object), shape=()
seq 1/105 (1.90%) (0:00:00) tag: train-other-500-1374-133833-0046
seq 1/105 (1.90%) (0:00:00) data: array([[-1.0181153 , -0.3433207 , -0.5385984 , ...,  0.26934746,
        -0.24952461, -0.19383669],
       [-1.031907  , -0.39967808, -0.79773605, ..., -0.19201769,
        -3.026414  , -1.4356858 ],
       [-1.0224752 , -0.38677597, -1.0231327 , ..., -0.01516061,
        -1.7981621 ,  0.44004616..., shape=(192, 50)
seq 1 target 'classes': array([221,  43,  68, 197, 762,  44, 335,  71,  47, 474], dtype=int32), shape=(10,) ('ITS NOT MY FA@@ UL@@ T JO@@ SI@@ A@@ H')
seq 1 target 'orth': array([73, 84, 83, 32, 78, 79, 84, 32, 77, 89, 32, 70, 65, 85, 76, 84, 32,
       74, 79, 83, 73, 65, 72], dtype=uint8), shape=(23,) ('ITS NOT MY FAULT JOSIAH')
seq 1 target 'raw': array('ITS NOT MY FAULT JOSIAH', dtype=object), shape=()
seq 2/105 (2.86%) (0:00:00) tag: train-clean-100-250-142286-0025
seq 2/105 (2.86%) (0:00:00) data: array([[-1.0639772 , -0.43535757,  0.9979144 , ...,  0.44950628,
        -0.10635232, -1.2175717 ],
       [-0.99800795, -0.9328595 ,  1.0114769 , ...,  0.43517488,
         0.40714452, -1.3986644 ],
       [-0.9970233 , -2.2936897 , -0.0412274 , ..., -0.7998198 ,
        -1.4338555 , -1.8155696 ..., shape=(209, 50)
seq 2 target 'classes': array([406, 591,  52, 169,  11,  30,   2, 356,  97], dtype=int32), shape=(9,) ('TOO CRO@@ W@@ DED THAT IS THE WOR@@ ST')
seq 2 target 'orth': array([84, 79, 79, 32, 67, 82, 79, 87, 68, 69, 68, 32, 84, 72, 65, 84, 32,
       73, 83, 32, 84, 72, 69, 32, 87, 79, 82, 83, 84], dtype=uint8), shape=(29,) ('TOO CROWDED THAT IS THE WORST')
seq 2 target 'raw': array('TOO CROWDED THAT IS THE WORST', dtype=object), shape=()
seq 3/105 (3.81%) (0:00:00) tag: train-clean-360-6119-48032-0040
seq 3/105 (3.81%) (0:00:00) data: array([[-0.9161725 , -0.3515298 , -0.92236227, ...,  0.4479296 ,
         1.3222326 , -0.60767865],
       [-0.94417214,  0.02018955, -1.0759346 , ...,  1.0356908 ,
         0.14922525, -0.18327384],
       [-0.9349811 , -0.1103144 , -0.9856503 , ..., -0.6657466 ,
        -1.6779268 , -1.789029  ..., shape=(213, 50)
seq 3 target 'classes': array([ 84, 528, 192, 100, 214, 724, 229], dtype=int32), shape=(7,) ('UN@@ NAT@@ UR@@ AL NO@@ TIONS ABOUT')
seq 3 target 'orth': array([85, 78, 78, 65, 84, 85, 82, 65, 76, 32, 78, 79, 84, 73, 79, 78, 83,
       32, 65, 66, 79, 85, 84], dtype=uint8), shape=(23,) ('UNNATURAL NOTIONS ABOUT')
seq 3 target 'raw': array('UNNATURAL NOTIONS ABOUT', dtype=object), shape=()
seq 4/105 (4.76%) (0:00:00) tag: train-other-500-5840-54188-0025
seq 4/105 (4.76%) (0:00:00) data: array([[-0.6875868 ,  0.55698854, -1.3179873 , ...,  0.49193764,
         1.1466433 , -0.1786345 ],
       [-0.69063985,  0.13344799, -1.1449655 , ...,  0.11755662,
        -0.45420724,  0.569198  ],
       [-0.68622375,  0.31605458, -1.1161832 , ..., -0.22542922,
         1.346604  , -0.05708468..., shape=(231, 50)
seq 4 target 'classes': array([ 93,  82,  67, 770, 858,  90, 418, 127], dtype=int32), shape=(8,) ('FROM SE@@ V@@ ENTY FOUR CON@@ CER@@ TS')
seq 4 target 'orth': array([70, 82, 79, 77, 32, 83, 69, 86, 69, 78, 84, 89, 32, 70, 79, 85, 82,
       32, 67, 79, 78, 67, 69, 82, 84, 83], dtype=uint8), shape=(26,) ('FROM SEVENTY FOUR CONCERTS')
seq 4 target 'raw': array('FROM SEVENTY FOUR CONCERTS', dtype=object), shape=()
seq 5/105 (5.71%) (0:00:00) tag: train-other-500-3319-171003-0063
seq 5/105 (5.71%) (0:00:00) data: array([[-0.8110962 , -0.14148454, -0.4258629 , ...,  0.23668024,
        -0.15315071,  0.09615742],
       [-0.8111036 , -0.15401922, -0.8482575 , ..., -0.20874035,
        -0.6203824 , -0.24077122],
       [-0.8120254 , -0.12016892, -0.6180197 , ...,  0.05419957,
         0.6076831 ,  1.8362247 ..., shape=(223, 50)
seq 5 target 'classes': array([  2, 518, 108, 196, 669, 315,  79, 340,  33,  18], dtype=int32), shape=(10,) ('THE SAME EX@@ PRE@@ SSION COMP@@ LE@@ X@@ I@@ ON')
seq 5 target 'orth': array([84, 72, 69, 32, 83, 65, 77, 69, 32, 69, 88, 80, 82, 69, 83, 83, 73,
       79, 78, 32, 67, 79, 77, 80, 76, 69, 88, 73, 79, 78], dtype=uint8), shape=(30,) ('THE SAME EXPRESSION COMPLEXION')
seq 5 target 'raw': array('THE SAME EXPRESSION COMPLEXION', dtype=object), shape=()
seq 6/105 (6.67%) (0:00:00) tag: train-other-500-1767-142932-0000
seq 6/105 (6.67%) (0:00:00) data: array([[-0.86693347, -5.3405786 , -0.50791055, ..., -2.4631803 ,
        -2.9376633 ,  0.13745485],
       [-0.8714278 , -3.3832295 ,  0.8302062 , ...,  0.01946406,
         1.518743  ,  1.2339076 ],
       [-0.8331261 , -1.1863018 , -0.06199729, ..., -2.3934097 ,
        -0.24568689,  0.8423811 ..., shape=(229, 50)
seq 6 target 'classes': array([  6, 609, 952,   3,  56, 289,  61], dtype=int32), shape=(7,) ('A NEW FRIEND AND AN OLD ONE')
seq 6 target 'orth': array([65, 32, 78, 69, 87, 32, 70, 82, 73, 69, 78, 68, 32, 65, 78, 68, 32,
       65, 78, 32, 79, 76, 68, 32, 79, 78, 69], dtype=uint8), shape=(27,) ('A NEW FRIEND AND AN OLD ONE')
seq 6 target 'raw': array('A NEW FRIEND AND AN OLD ONE', dtype=object), shape=()
Done. Total time 0:00:00.3895. More seqs which we did not dumped: True
Seq-length 'classes' Stats:
  6 seqs
  Mean: 7.636363636363637
  Std dev: 2.0123585110162416
  Min/max: 4 / 10
Seq-length 'data' Stats:
  6 seqs
  Mean: 206.0909090909091
  Std dev: 19.33651536462247
  Min/max: 165 / 231
Seq-length 'orth' Stats:
  6 seqs
  Mean: 25.18181818181818
  Std dev: 6.685373203945541
  Min/max: 13 / 40
Seq-length 'raw' Stats:
  6 seqs
  Mean: 1.0
  Std dev: 0.0
  Min/max: 1 / 1
Quitting

jotix16 avatar Apr 09 '21 18:04 jotix16

I think also you want to have a separate tool like tools/test-network.py or so?

The dummy dataset would just be created such that it matches extern_data (or num_inputs/num_outputs for old configs).

albertz avatar Apr 09 '21 18:04 albertz

The dummy dataset above is the same as DummyDatasetMultipleSequenceLength, only that I added a Vocab for the target labels.

I think also you want to have a separate tool like tools/test-network.py or so?

That would be the cleanest solution actually. tools/test-network.py should look as similar as possible to the normal training only that the dataset is much smaller(one sequence per subepoch). And we can print some more information on the way too.

jotix16 avatar Apr 09 '21 18:04 jotix16

The dummy dataset above is the same as DummyDatasetMultipleSequenceLength, only that I added a Vocab for the target labels.

Why do you need the labels (vocab) in the dataset for testing?

And if it is the same otherwise, why not extend DummyDatasetMultipleSequenceLength instead?

I think also you want to have a separate tool like tools/test-network.py or so?

That would be the cleanest solution actually.

Yes, I thought that was the intention. But in any case, such tool would then internally make use of such a dataset. (But I think DummyDatasetMultipleSequenceLength or maybe also StaticDataset are already fine.)

tools/test-network.py should look as similar as possible to the normal training

Yes, you would probably just call test-network.py <config> <other-args> instead of rnn.py <config> <other-args>.

only that the dataset is much smaller(one sequence per subepoch).

It should be a couple of seqs, such that batch size is >1 (with bs 1, you sometimes can hide some bugs which occur only with bs >1). And also you want that it runs for more than one step per (sub)epoch, because some bugs might only be triggered in step 2 or later (e.g. keep-over-epoch logic of hidden state or so).

And we can print some more information on the way too.

Like what?

albertz avatar Apr 09 '21 20:04 albertz