speechbrain
                                
                                 speechbrain copied to clipboard
                                
                                    speechbrain copied to clipboard
                            
                            
                            
                        recipes/Common voice issue
I have been trying to apply the common voice recipe to a new language and i could manage to construct the data files but when i have reached the fit step i got this error:
speechbrain.tokenizers.SentencePiece - Tokenizer is already trained.
speechbrain.tokenizers.SentencePiece - ==== Loading Tokenizer ===
speechbrain.tokenizers.SentencePiece - Tokenizer path: results/CRDNN_it/1234/save\500_unigram.model
speechbrain.tokenizers.SentencePiece - Tokenizer vocab_size: 500
speechbrain.tokenizers.SentencePiece - Tokenizer type: unigram
speechbrain.core - 148.2M trainable parameters in ASR
speechbrain.utils.checkpoints - Would load a checkpoint here, but none found yet.
speechbrain.utils.epoch_loop - Going into epoch 1
0%|                                                                                         | 0/2140 [00:01<?, ?it/s]
Traceback (most recent call last):
File "
@Gastron Do you think this can due to a problem in the definition of the audio pipeline ? @monaabdelazim, could you please post you .py files as well ?
Yeah there is some issue with how the data loading pipeline functions get defined. If you can share the Python training script it might be easy to spot what you need to change.
Find below the train script. train.txt Thanks in advance.
Huh, similar code actually works fine. But I think I know what the issue is. You're using the spawn (Windows default) method for creating the background data loading processes, instead of fork (Unix default). You're working on Windows, so the fork method for creating a process is not available. The short explanation is, as far as I understand it, in this case, that fork copies everything in memory, while spawn pickles stuff so that it could be recreated, so things like the dynamic items created in dataio_prepare cannot be transferred to the background process by spawn. So the dynamic items should be moved into the module level (outside dataio_prepare).
I've sketched a very hacky and ugly solution, but I've not tested it. Both pipelines use some things from the dynamically loaded hparams file. The easiest workaround is to just hard code them in the training script. For the audio pipeline, this just means the sample_rate. The text pipeline is particularly tricky because it uses the dynamically loaded / created tokenizer. I think the easiest solution in this case is to just hard code the tokenizer arguments and put it into the module level code as well. (See the code, you need to fill in the arguments from your hparams file.)
# Define audio pipeline:
sample_rate = ...
@sb.utils.data_pipeline.takes("wav")
@sb.utils.data_pipeline.provides("sig")
def audio_pipeline(wav):
    info = torchaudio.info(wav)
    sig = sb.dataio.dataio.read_audio(wav)
    if info.num_channels > 1:
        sig = torch.mean(sig, dim=1)
    resampled = torchaudio.transforms.Resample(
        info.sample_rate, sample_rate,
    )(sig)
    return resampled
# Load tokenizer in global scope (ugly hack, sorry!)
# TODO: Fill in the values from hparams manually.
tokenizer = SentencePiece(
    model_dir=...,
    vocab_size=...,
    annotation_train=...,
    annotation_read="wrd",
    model_type=...,
    character_coverage=...,
)
# Define text pipeline:
@sb.utils.data_pipeline.takes("wrd")
@sb.utils.data_pipeline.provides(
    "tokens_list", "tokens_bos", "tokens_eos", "tokens"
)
def text_pipeline(wrd):
    tokens_list = tokenizer.sp.encode_as_ids(wrd)
    yield tokens_list
    tokens_bos = torch.LongTensor([hparams["bos_index"]] + (tokens_list))
    yield tokens_bos
    tokens_eos = torch.LongTensor(tokens_list + [hparams["eos_index"]])
    yield tokens_eos
    tokens = torch.LongTensor(tokens_list)
    yield tokens
# Define custom data procedure
def dataio_prepare(hparams, tokenizer):
    """This function prepares the datasets to be used in the brain class.
    It also defines the data processing pipeline through user-defined functions."""
    # 1. Define datasets
    data_folder = hparams["data_folder"]
    train_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
        csv_path=hparams["train_csv"], replacements={"data_root": data_folder},
    )
    if hparams["sorting"] == "ascending":
        # we sort training data to speed up training and get better results.
        train_data = train_data.filtered_sorted(
            sort_key="duration",
            key_max_value={"duration": hparams["avoid_if_longer_than"]},
        )
        # when sorting do not shuffle in dataloader ! otherwise is pointless
        hparams["dataloader_options"]["shuffle"] = False
    elif hparams["sorting"] == "descending":
        train_data = train_data.filtered_sorted(
            sort_key="duration",
            reverse=True,
            key_max_value={"duration": hparams["avoid_if_longer_than"]},
        )
        # when sorting do not shuffle in dataloader ! otherwise is pointless
        hparams["dataloader_options"]["shuffle"] = False
    elif hparams["sorting"] == "random":
        pass
    else:
        raise NotImplementedError(
            "sorting must be random, ascending or descending"
        )
    valid_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
        csv_path=hparams["valid_csv"], replacements={"data_root": data_folder},
    )
    # We also sort the validation data so it is faster to validate
    valid_data = valid_data.filtered_sorted(sort_key="duration")
    test_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
        csv_path=hparams["test_csv"], replacements={"data_root": data_folder},
    )
    # We also sort the validation data so it is faster to validate
    test_data = test_data.filtered_sorted(sort_key="duration")
    datasets = [train_data, valid_data, test_data]
    sb.dataio.dataset.add_dynamic_item(datasets, audio_pipeline)
    sb.dataio.dataset.add_dynamic_item(datasets, text_pipeline)
    # 4. Set output:
    sb.dataio.dataset.set_output_keys(
        datasets, ["id", "sig", "tokens_bos", "tokens_eos", "tokens"],
    )
    return train_data, valid_data, test_data
It could well be that this does not work at all, sorry for putting the burden of testing on you.
Is the issue solved ?
@TParcollet I guess this error arose because I run the tool on Windows so I set the num_of_workers to zero as mentioned in https://discuss.pytorch.org/t/cant-pickle-local-object-dataloader-init-locals-lambda/31857/26 The training started successfully but it takes around 24 hours to complete a single epoch. So is there any feasible solution to run multiprocessing using Windows?
@Gastron thanks for your solution. I appreciate your efforts.
@monaabdelazim I actually have no idea about that :-( The thing is, num_workers only helps if you have slow reading devices (HDD / SSD) it won't make any difference if it's quick. 24 Hours for Italian, however, is quite long .. What is your GPU?
My GPU is NVIDIA GeForce GTX 1060
Hello,
any updates on this issue? is the issue still up?
Hello,
There has been no activity for a very long time. Therefore, I am closing this issue.
Feel free to reopen if needed. Thanks! :)