speechbrain icon indicating copy to clipboard operation
speechbrain copied to clipboard

recipes/Common voice issue

Open monaabdelazim opened this issue 4 years ago • 10 comments

I have been trying to apply the common voice recipe to a new language and i could manage to construct the data files but when i have reached the fit step i got this error: speechbrain.tokenizers.SentencePiece - Tokenizer is already trained. speechbrain.tokenizers.SentencePiece - ==== Loading Tokenizer === speechbrain.tokenizers.SentencePiece - Tokenizer path: results/CRDNN_it/1234/save\500_unigram.model speechbrain.tokenizers.SentencePiece - Tokenizer vocab_size: 500 speechbrain.tokenizers.SentencePiece - Tokenizer type: unigram speechbrain.core - 148.2M trainable parameters in ASR speechbrain.utils.checkpoints - Would load a checkpoint here, but none found yet. speechbrain.utils.epoch_loop - Going into epoch 1 0%| | 0/2140 [00:01<?, ?it/s] Traceback (most recent call last): File "", line 1, in File "C:\Users\Mona\anaconda3\lib\multiprocessing\spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "C:\Users\Mona\anaconda3\lib\multiprocessing\spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) EOFError: Ran out of input speechbrain.core - Exception: Traceback (most recent call last): File "train.py", line 329, in asr_brain.fit( File "C:\Users\Mona\anaconda3\lib\site-packages\speechbrain\core.py", line 1031, in fit for batch in t: File "C:\Users\Mona\anaconda3\lib\site-packages\tqdm\std.py", line 1178, in iter for obj in iterable: File "C:\Users\Mona\anaconda3\lib\site-packages\speechbrain\dataio\dataloader.py", line 206, in iter iterator = super().iter() File "C:\Users\Mona\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 359, in iter return self._get_iterator() File "C:\Users\Mona\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 305, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "C:\Users\Mona\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 918, in init w.start() File "C:\Users\Mona\anaconda3\lib\multiprocessing\process.py", line 121, in start self._popen = self._Popen(self) File "C:\Users\Mona\anaconda3\lib\multiprocessing\context.py", line 224, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "C:\Users\Mona\anaconda3\lib\multiprocessing\context.py", line 327, in _Popen return Popen(process_obj) File "C:\Users\Mona\anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 93, in init reduction.dump(process_obj, to_child) File "C:\Users\Mona\anaconda3\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'dataio_prepare..audio_pipeline'

monaabdelazim avatar Oct 01 '21 23:10 monaabdelazim

@Gastron Do you think this can due to a problem in the definition of the audio pipeline ? @monaabdelazim, could you please post you .py files as well ?

TParcollet avatar Oct 04 '21 10:10 TParcollet

Yeah there is some issue with how the data loading pipeline functions get defined. If you can share the Python training script it might be easy to spot what you need to change.

Gastron avatar Oct 04 '21 11:10 Gastron

Find below the train script. train.txt Thanks in advance.

monaabdelazim avatar Oct 04 '21 20:10 monaabdelazim

Huh, similar code actually works fine. But I think I know what the issue is. You're using the spawn (Windows default) method for creating the background data loading processes, instead of fork (Unix default). You're working on Windows, so the fork method for creating a process is not available. The short explanation is, as far as I understand it, in this case, that fork copies everything in memory, while spawn pickles stuff so that it could be recreated, so things like the dynamic items created in dataio_prepare cannot be transferred to the background process by spawn. So the dynamic items should be moved into the module level (outside dataio_prepare).

I've sketched a very hacky and ugly solution, but I've not tested it. Both pipelines use some things from the dynamically loaded hparams file. The easiest workaround is to just hard code them in the training script. For the audio pipeline, this just means the sample_rate. The text pipeline is particularly tricky because it uses the dynamically loaded / created tokenizer. I think the easiest solution in this case is to just hard code the tokenizer arguments and put it into the module level code as well. (See the code, you need to fill in the arguments from your hparams file.)

# Define audio pipeline:
sample_rate = ...
@sb.utils.data_pipeline.takes("wav")
@sb.utils.data_pipeline.provides("sig")
def audio_pipeline(wav):
    info = torchaudio.info(wav)
    sig = sb.dataio.dataio.read_audio(wav)
    if info.num_channels > 1:
        sig = torch.mean(sig, dim=1)
    resampled = torchaudio.transforms.Resample(
        info.sample_rate, sample_rate,
    )(sig)
    return resampled

# Load tokenizer in global scope (ugly hack, sorry!)
# TODO: Fill in the values from hparams manually.
tokenizer = SentencePiece(
    model_dir=...,
    vocab_size=...,
    annotation_train=...,
    annotation_read="wrd",
    model_type=...,
    character_coverage=...,
)

# Define text pipeline:
@sb.utils.data_pipeline.takes("wrd")
@sb.utils.data_pipeline.provides(
    "tokens_list", "tokens_bos", "tokens_eos", "tokens"
)
def text_pipeline(wrd):
    tokens_list = tokenizer.sp.encode_as_ids(wrd)
    yield tokens_list
    tokens_bos = torch.LongTensor([hparams["bos_index"]] + (tokens_list))
    yield tokens_bos
    tokens_eos = torch.LongTensor(tokens_list + [hparams["eos_index"]])
    yield tokens_eos
    tokens = torch.LongTensor(tokens_list)
    yield tokens


# Define custom data procedure
def dataio_prepare(hparams, tokenizer):
    """This function prepares the datasets to be used in the brain class.
    It also defines the data processing pipeline through user-defined functions."""

    # 1. Define datasets
    data_folder = hparams["data_folder"]

    train_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
        csv_path=hparams["train_csv"], replacements={"data_root": data_folder},
    )

    if hparams["sorting"] == "ascending":
        # we sort training data to speed up training and get better results.
        train_data = train_data.filtered_sorted(
            sort_key="duration",
            key_max_value={"duration": hparams["avoid_if_longer_than"]},
        )
        # when sorting do not shuffle in dataloader ! otherwise is pointless
        hparams["dataloader_options"]["shuffle"] = False

    elif hparams["sorting"] == "descending":
        train_data = train_data.filtered_sorted(
            sort_key="duration",
            reverse=True,
            key_max_value={"duration": hparams["avoid_if_longer_than"]},
        )
        # when sorting do not shuffle in dataloader ! otherwise is pointless
        hparams["dataloader_options"]["shuffle"] = False

    elif hparams["sorting"] == "random":
        pass

    else:
        raise NotImplementedError(
            "sorting must be random, ascending or descending"
        )

    valid_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
        csv_path=hparams["valid_csv"], replacements={"data_root": data_folder},
    )
    # We also sort the validation data so it is faster to validate
    valid_data = valid_data.filtered_sorted(sort_key="duration")

    test_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
        csv_path=hparams["test_csv"], replacements={"data_root": data_folder},
    )

    # We also sort the validation data so it is faster to validate
    test_data = test_data.filtered_sorted(sort_key="duration")

    datasets = [train_data, valid_data, test_data]



    sb.dataio.dataset.add_dynamic_item(datasets, audio_pipeline)


    sb.dataio.dataset.add_dynamic_item(datasets, text_pipeline)

    # 4. Set output:
    sb.dataio.dataset.set_output_keys(
        datasets, ["id", "sig", "tokens_bos", "tokens_eos", "tokens"],
    )
    return train_data, valid_data, test_data

It could well be that this does not work at all, sorry for putting the burden of testing on you.

Gastron avatar Oct 05 '21 09:10 Gastron

Is the issue solved ?

TParcollet avatar Oct 07 '21 15:10 TParcollet

@TParcollet I guess this error arose because I run the tool on Windows so I set the num_of_workers to zero as mentioned in https://discuss.pytorch.org/t/cant-pickle-local-object-dataloader-init-locals-lambda/31857/26 The training started successfully but it takes around 24 hours to complete a single epoch. So is there any feasible solution to run multiprocessing using Windows?

monaabdelazim avatar Oct 08 '21 00:10 monaabdelazim

@Gastron thanks for your solution. I appreciate your efforts.

monaabdelazim avatar Oct 08 '21 00:10 monaabdelazim

@monaabdelazim I actually have no idea about that :-( The thing is, num_workers only helps if you have slow reading devices (HDD / SSD) it won't make any difference if it's quick. 24 Hours for Italian, however, is quite long .. What is your GPU?

TParcollet avatar Oct 12 '21 09:10 TParcollet

My GPU is NVIDIA GeForce GTX 1060

monaabdelazim avatar Oct 13 '21 01:10 monaabdelazim

Hello,

any updates on this issue? is the issue still up?

Adel-Moumen avatar Sep 06 '22 14:09 Adel-Moumen

Hello,

There has been no activity for a very long time. Therefore, I am closing this issue.

Feel free to reopen if needed. Thanks! :)

Adel-Moumen avatar Sep 26 '22 18:09 Adel-Moumen