icefall icon indicating copy to clipboard operation
icefall copied to clipboard

fix the CTC zipformer2 training

Open KarelVesely84 opened this issue 1 year ago • 10 comments

  • too many supervision tokens
  • change filtering rule to if (T - 2) < len(tokens): return False
  • this prevents inf. from appearing in the CTC loss value (empirically tested)

KarelVesely84 avatar Aug 12 '24 08:08 KarelVesely84

workflow with error: https://github.com/k2-fsa/icefall/actions/runs/10348851808/job/28642009312?pr=1713

fatal: unable to access 'https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15/': Recv failure: Connection reset by peer

but the file location https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15/ exists...

maybe to many tests at the same time ? (overloaded HuggingFace ?)

KarelVesely84 avatar Aug 12 '24 12:08 KarelVesely84

Hi @csukuangfj , how about this one ? Is @yaozengwei testing it currently ?

It is solving the issue https://github.com/k2-fsa/icefall/issues/1352

My theory is that CTC uses 2 extra symbols at beginning/end of label sequence. So, the label-length limit needs to be lowered by 2 symbols to accomodate that.

Best regards Karel

KarelVesely84 avatar Aug 14 '24 12:08 KarelVesely84

Sorry for the late reply.

Could you analyze the wave that causes inf loss? Is it too short?

Does it contain only a single word or does it contain nothing at all?

csukuangfj avatar Aug 19 '24 02:08 csukuangfj

Hi, the problematic utterance contained many words: (num_embeddings, supervision_length, difference a-b) = (34, 33, 1)

text: ['▁O', 'f', '▁all', '▁.', '▁P', 'ar', 'li', 'a', 'ment', '▁,', '▁Co', 'un', 'c', 'il', '▁and', '▁Co', 'm', 'm', 'i', 's', 's', 'ion', '▁are', '▁work', 'ing', '▁to', 'ge', 'ther', '▁to', '▁de', 'li', 'ver', '▁.']

It seems like a better set of BPEs could reduce the number of supervision tokens. Nevertheless, this would only hide the ``inf.'' problem for CTC.

I believe the two extra tokens for the CTC loss are the <bos/eos> that get (pre-,ap-)pended to the supervision sequence, hence the (T - 2).

Best regards Karel

KarelVesely84 avatar Aug 26 '24 10:08 KarelVesely84

the problematic utterance contained many words:

Thanks for sharing! Could you also post the duration of the corresponding wave file?

csukuangfj avatar Aug 26 '24 10:08 csukuangfj

This is the corresponding Cut:

MonoCut(id='20180612-0900-PLENARY-3-59', start=557.34, duration=1.44, channel=0, supervisions=[SupervisionSegment(id='20180612-0900-PLENARY-3-59', recording_id='20180612-0900-PLENARY-3', start=0.0, duration=1.44, channel=0, text='Of all . Parliament , Council and Commission are working together to deliver .', language='en', speaker='None', gender='male', custom={'orig_text': 'of all. Parliament, Council and Commission are working together to deliver.'}, alignment=None)], features=Features(type='kaldi-fbank', num_frames=144, num_features=80, frame_shift=0.01, sampling_rate=16000, start=557.34, duration=1.44, storage_type='lilcom_chunky', storage_path='data/fbank/voxpopuli-asr-en-train_feats/feats-59.lca', storage_key='395124474,12987', recording_id='None', channels=0), recording=Recording(id='20180612-0900-PLENARY-3', sources=[AudioSource(type='file', channels=[0], source='/mnt/matylda6/szoke/EU-ASR/DATA/voxpopuli/raw_audios/en/2018/20180612-0900-PLENARY-3_en.ogg')], sampling_rate=16000, num_samples=139896326, duration=8743.520375, channel_ids=[0], transforms=None), custom={'dataloading_info': {'rank': 3, 'world_size': 4, 'worker_id': None}})

It is a 1.44 sec long cut inside a very long recording (2.42 hrs). And the 1.44 sec is very little to pronounce all the words in the reference text : "Of all . Parliament , Council and Commission are working together to deliver ."

Definitely a data issue. And if the Cut is filtered out, and consequently the CTC stops breaking, it sholud be seen as a good thing...

K.

KarelVesely84 avatar Aug 30 '24 11:08 KarelVesely84

yes, I think it should be good to filter out such kind of data.

csukuangfj avatar Aug 30 '24 12:08 csukuangfj

Hello, is there something needed for this to merge from my side ? K.

KarelVesely84 avatar Sep 17 '24 08:09 KarelVesely84

The root cause is due to bad data. Would it be more appropriate to fix it when preparing the data?

The -2 thing is not a constraint for computing the CTC or the transducer loss.

csukuangfj avatar Sep 17 '24 13:09 csukuangfj

Well, without that (T - 2) change i was getting inf. value from the CTC loss. There sholud be no inf. even if the data are prepared badly.

I also did not find any trace of the extra CTC symbols or or similar in the scripts. The torch.nn.functional.ctc_loss(.) is getting the same set of symbols as transducer loss.

Could you try to reproduce the issue by adding a training example with a very lenghty transcript ? (or I can create a branch to demonstrate it, say repeating the librispeech transcript 100x, just to make the error appear)

Best regards, Karel

KarelVesely84 avatar Sep 24 '24 15:09 KarelVesely84