icefall icon indicating copy to clipboard operation
icefall copied to clipboard

pre-trained gigaspeech model decoding error

Open shaynemei opened this issue 3 years ago • 13 comments

Unable to decode AMI SDM full-corpur-asr dev set.

  1. Prepared AMI test and dev set using this command:
lhotse prepare ami \
  --mic sdm \
  --partition full-corpus-asr  \
  <path-to-downloaded-ami-corpus> \
  "data/manifests/${data_name}"
  1. Preprocessed data with almost identical script to this
  2. Computed features with almost identical script to this

After these steps, I was able to successfully decode the test set (WER: 36.62) using the pretrained model here, but I got the following error when decoding the dev set:

snap 2022-07-27 at 2 26 06 AM

The encoder_out and encoder_out_lens parameter passed into modified_beam_search()that are used as input and lengths and triggered the above error in torch.nn.utils.rnn.pack_padded_sequence() looks like this: snap 2022-07-27 at 2 38 18 AM snap 2022-07-27 at 2 38 26 AM

OS: Ubuntu 20.04.3 k2: 1.17.dev20220725+cuda10.2.torch1.7.1 lhotse: 1.5.0.dev0+git.dc5aa88.clean python: 3.8.10

shaynemei avatar Jul 27 '22 09:07 shaynemei

Note: For the second picture, you can see that the last element is 0.

Please check your data and ensure that after subsampling, there are still frames left. That is, you have to filter out your dev utterances that are too short

csukuangfj avatar Jul 27 '22 09:07 csukuangfj

I see, but how do I determine how short is too short? Or do you mean filtering out the utterances with no frames after the compute feature step (I assume this is where the subsampling is happening?)

shaynemei avatar Jul 27 '22 09:07 shaynemei

The subsampling formula for computing the number of output frames is

num_out_frames = ((num_in_frames - 1)//2 - 1) // 2

In order to make num_out_frames > 0, you have to ensure that num_in_frames >= 7.

csukuangfj avatar Jul 27 '22 09:07 csukuangfj

by frames are you referring to samples in terms of a mono-channel audio?

shaynemei avatar Jul 27 '22 18:07 shaynemei

I did not find any audio segments that have less than 7 samples, but I was able to complete the decoding without errors after removing segments with less than 1000 samples. I assume I'm not understanding your previous comment correctly. Can you elaborate on the definition of num_in_frames here?

shaynemei avatar Jul 27 '22 18:07 shaynemei

Some additional info: after playing around with the threshold and looking at the filtered out segments, it looks like the longest segments that are causing this error have 960 samples (sample rate 16k).

shaynemei avatar Jul 27 '22 18:07 shaynemei

by frames are you referring to samples in terms of a mono-channel audio?

Frames mean number of feature frames here.

A feature frame is computed from 400 audio samples when window size is 25ms and sample rate is 16kHz.

csukuangfj avatar Jul 27 '22 23:07 csukuangfj

I see, thanks. Is window size defined somewhere in the feature computation stage?

shaynemei avatar Jul 27 '22 23:07 shaynemei

~~for others' reference, decoding window size is inferred from chunk_size and other hard-coded parameters here: https://github.com/k2-fsa/icefall/blob/d792bdc9bc5e9451ee9119954ddaa4df11167b4e/egs/librispeech/ASR/streaming_conformer_ctc/conformer.py#L274~~

please correct me if i'm wrong.

shaynemei avatar Jul 27 '22 23:07 shaynemei

I see, thanks. Is window size defined somewhere in the feature computation stage?

Please see https://github.com/k2-fsa/icefall/blob/67e3607863807b19037c9b371624f111e76f9c63/egs/fisher_swbd/ASR/local/compute_fbank_fisher_swbd_eval2000.py#L61

You can find the definition of FbankConfig in https://github.com/k2-fsa/icefall/blob/67e3607863807b19037c9b371624f111e76f9c63/egs/fisher_swbd/ASR/local/compute_fbank_fisher_swbd_eval2000.py#L31

https://github.com/lhotse-speech/lhotse/blob/dc5aa88b53adb73c20e3bf7210742ec3a25c809c/lhotse/features/init.py#L30

from .kaldi.extractors import Fbank, FbankConfig, Mfcc, MfccConfig

https://github.com/lhotse-speech/lhotse/blob/master/lhotse/features/kaldi/extractors.py#L13

@dataclass
class FbankConfig:
    sampling_rate: int = 16000
    frame_length: Seconds = 0.025
    frame_shift: Seconds = 0.01

csukuangfj avatar Jul 27 '22 23:07 csukuangfj

for others' reference, decoding window size is inferred from chunk_size and other hard-coded parameters here:

What is decoding window size? Not sure what you want to express.

csukuangfj avatar Jul 27 '22 23:07 csukuangfj

I understand now, thanks for the clarification. I was referring to that line of code I posted.

shaynemei avatar Jul 27 '22 23:07 shaynemei

@csukuangfj In general, if a user is decoding their data with some pretrained model, perhaps the model should have a validation stage which checks if the feature lengths are okay, instead of expecting the user to verify this (since they may not be familiar with the exact subsampling in each model type)?

desh2608 avatar Oct 14 '22 18:10 desh2608