icefall Can I use the fbank features already extracted by Kaldi to train with Icefall script?

For most speech datasets, we have already extracted their fbank features by compute-fbank-feats of Kaldi. Is it possible to generate the (dataset_name)_cuts_train.jsonl.gz directly using Kaldi's various List ( wav.scp, utt2spk, spk2utt and etc.) and fbank in ark format? The training scripts in icefall are strongly related to the feature processing of lhotse. It may cause repeated extraction of fbank features from the same dataset? Thanks!

Aug 02 '22 07:08 Aurora-6

Of course, you can. Please see https://lhotse.readthedocs.io/en/latest/kaldi.html

https://github.com/lhotse-speech/lhotse/blob/master/lhotse/kaldi.py

Aug 02 '22 07:08 csukuangfj

From https://lhotse.readthedocs.io/en/latest/kaldi.html

# Convert data/train to train_manifests/{recordings,supervisions}.json
lhotse kaldi import \
    data/train \
    16000 \
    train_manifests

# Convert train_manifests/{recordings,supervisions}.json to data/train
lhotse kaldi export \
    train_manifests/recordings.json \
    train_manifests/supervisions.json \
    data/train

Aug 02 '22 07:08 csukuangfj

Thanks a lot~ I'll try it!

Aug 02 '22 07:08 Aurora-6

I have trained transducer model with the hand-crafted cuts.jsonl.gz following the scripts (https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless2). But when decoding use the command

./pruned_transducer_stateless2/decode.py \ --simulate-streaming 1 \ --bpe-model ./data/lang_bpe_5000/bpe.model \ --decode-chunk-size 16 \ --causal-convolution 1 \ --epoch 22 \ --avg 10 \ --exp-dir ./pruned_transducer_stateless2/exp \ --max-sym-per-frame 1 \ --max-duration 100 \ --decoding-method greedy_search

The hyp is the same for all speeches. I checked that each speech has different features as input, but they all output the same probabilities on vocabulary list. So what might cause that?

Aug 08 '22 07:08 Aurora-6

Here I use the following steps to generate final cuts.jsonl.gz.

step1: lhotse kaldi import -f 0.01\ ${librispeech_data}/test-clean \ 16000 \ data/manifests/${name} To generate features.jsonl.gz, recordings.jsonl.gz and supervisions.jsonl.gz

step2: run this function def compute_fbank_librispeech(): src_dir = Path("data/manifests") output_dir = Path("data/fbank") num_jobs = min(15, os.cpu_count()) num_mel_bins = 80

dataset_parts = (
    "test-clean",
)
prefix = "librispeech"
suffix = "jsonl.gz"
manifests = read_manifests_if_cached(
    dataset_parts=dataset_parts,
    output_dir=src_dir,
    prefix=prefix,
    suffix=suffix,
    types=("recordings", "supervisions", "features")
)
assert manifests is not None
with get_executor() as ex:  # Initialize the executor only once.
    for partition, m in manifests.items():
        cuts_filename = f"{prefix}_cuts_{partition}.{suffix}"
        if (output_dir / cuts_filename).is_file():
            logging.info(f"{partition} already exists - skipping.")
            continue
        logging.info(f"Processing {partition}")
        cut_set = CutSet.from_manifests(
            recordings=m["recordings"],
            features=m["features"],
            supervisions=m["supervisions"],
        )
        cut_set.to_file(output_dir / cuts_filename)

But it seems wrong when I use https://huggingface.co/pkufool/icefall_librispeech_streaming_pruned_transducer_stateless2_20220625/blob/main/exp/pretrained-epoch-24-avg-10.pt provided by yours to decode test_cuts generated as described above.

Aug 08 '22 11:08 Aurora-6

But when decoding use the command

./pruned_transducer_stateless2/decode.py \ --simulate-streaming 1 \ --bpe-model ./data/lang_bpe_5000/bpe.model \ --decode-chunk-size 16 \ --causal-convolution 1 \ --epoch 22 \ --avg 10 \ --exp-dir ./pruned_transducer_stateless2/exp \ --max-sym-per-frame 1 \ --max-duration 100 \ --decoding-method greedy_search

@Aurora-6

Are you using features generated by kaldi to train the model but using features generated by lhotse to test the trained model?

If that is the case, you won't get the expected recognition results.

Kaldi uses samples in the range [-32768, 32767) to extract the features, while lhotse uses samples in the range [-1, 1).

I suggest that you also use features from kaldi to test decode.py.

Aug 15 '22 14:08 csukuangfj

But when decoding use the command ./pruned_transducer_stateless2/decode.py \ --simulate-streaming 1 \ --bpe-model ./data/lang_bpe_5000/bpe.model \ --decode-chunk-size 16 \ --causal-convolution 1 \ --epoch 22 \ --avg 10 \ --exp-dir ./pruned_transducer_stateless2/exp \ --max-sym-per-frame 1 \ --max-duration 100 \ --decoding-method greedy_search

@Aurora-6

Are you using features generated by kaldi to train the model but using features generated by lhotse to test the trained model?

If that is the case, you won't get the expected recognition results.

Kaldi uses samples in the range [-32768, 32767) to extract the features, while lhotse uses samples in the range [-1, 1).

I suggest that you also use features from kaldi to test decode.py.

@csukuangfj I have the similar issue that both wav and kaldi extracted features(fbank80) in my training corpus. I want make generating fbank80 from lhotse same with kaldi, what should I consider about, except different sample range, any other suggestion?

Aug 18 '22 09:08 LeonWlw

except different sample range, any other suggestion?

Using unnormalized samples for Kaldi is the only thing that I can think of.

Aug 18 '22 09:08 csukuangfj

I also can confirm that I have had a successful experience with using Kaldi features for the training dataset and kaldifeat with preliminary scaling to [-32768, 32767) for test datsets (there was a very small ~1% relative degradation in WER comparing to Kaldi features for test datsets).

Oct 03 '22 19:10 videodanchik

I also can confirm that I have had a successful experience with using Kaldi features for the training dataset and kaldifeat with preliminary scaling to [-32768, 32767) for test datsets (there was a very small ~1% relative degradation in WER comparing to Kaldi features for test datsets).

Thanks for the information.

Dither is enabled by default in kaldi. Do you also use dither with kaldifeat?

Oct 03 '22 23:10 csukuangfj

I also can confirm that I have had a successful experience with using Kaldi features for the training dataset and kaldifeat with preliminary scaling to [-32768, 32767) for test datsets (there was a very small ~1% relative degradation in WER comparing to Kaldi features for test datsets).

Thanks for the information.

Dither is enabled by default in kaldi. Do you also use dither with kaldifeat?

I didn't change the default behavior, so it's supposed to enabled by default? https://github.com/csukuangfj/kaldifeat/blob/72aa5eab2b60ba1c3dc4b60be476eaf1d7816f71/kaldifeat/python/tests/test_fbank_options.py#L18

Oct 03 '22 23:10 videodanchik

icefall icefall copied to clipboard

Can I use the fbank features already extracted by Kaldi to train with Icefall script?

icefall
icefall copied to clipboard