icefall
icefall copied to clipboard
fix the CTC zipformer2 training
- too many supervision tokens
- change filtering rule to
if (T - 2) < len(tokens): return False - this prevents inf. from appearing in the CTC loss value (empirically tested)
workflow with error: https://github.com/k2-fsa/icefall/actions/runs/10348851808/job/28642009312?pr=1713
fatal: unable to access 'https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15/': Recv failure: Connection reset by peer
but the file location https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15/ exists...
maybe to many tests at the same time ? (overloaded HuggingFace ?)
Hi @csukuangfj , how about this one ? Is @yaozengwei testing it currently ?
It is solving the issue https://github.com/k2-fsa/icefall/issues/1352
My theory is that CTC uses 2 extra symbols at beginning/end of label sequence. So, the label-length limit needs to be lowered by 2 symbols to accomodate that.
Best regards Karel
Sorry for the late reply.
Could you analyze the wave that causes inf loss? Is it too short?
Does it contain only a single word or does it contain nothing at all?
Hi,
the problematic utterance contained many words:
(num_embeddings, supervision_length, difference a-b) = (34, 33, 1)
text: ['▁O', 'f', '▁all', '▁.', '▁P', 'ar', 'li', 'a', 'ment', '▁,', '▁Co', 'un', 'c', 'il', '▁and', '▁Co', 'm', 'm', 'i', 's', 's', 'ion', '▁are', '▁work', 'ing', '▁to', 'ge', 'ther', '▁to', '▁de', 'li', 'ver', '▁.']
It seems like a better set of BPEs could reduce the number of supervision tokens. Nevertheless, this would only hide the ``inf.'' problem for CTC.
I believe the two extra tokens for the CTC loss are the <bos/eos>
that get (pre-,ap-)pended to the supervision sequence,
hence the (T - 2).
Best regards Karel
the problematic utterance contained many words:
Thanks for sharing! Could you also post the duration of the corresponding wave file?
This is the corresponding Cut:
MonoCut(id='20180612-0900-PLENARY-3-59', start=557.34, duration=1.44, channel=0, supervisions=[SupervisionSegment(id='20180612-0900-PLENARY-3-59', recording_id='20180612-0900-PLENARY-3', start=0.0, duration=1.44, channel=0, text='Of all . Parliament , Council and Commission are working together to deliver .', language='en', speaker='None', gender='male', custom={'orig_text': 'of all. Parliament, Council and Commission are working together to deliver.'}, alignment=None)], features=Features(type='kaldi-fbank', num_frames=144, num_features=80, frame_shift=0.01, sampling_rate=16000, start=557.34, duration=1.44, storage_type='lilcom_chunky', storage_path='data/fbank/voxpopuli-asr-en-train_feats/feats-59.lca', storage_key='395124474,12987', recording_id='None', channels=0), recording=Recording(id='20180612-0900-PLENARY-3', sources=[AudioSource(type='file', channels=[0], source='/mnt/matylda6/szoke/EU-ASR/DATA/voxpopuli/raw_audios/en/2018/20180612-0900-PLENARY-3_en.ogg')], sampling_rate=16000, num_samples=139896326, duration=8743.520375, channel_ids=[0], transforms=None), custom={'dataloading_info': {'rank': 3, 'world_size': 4, 'worker_id': None}})
It is a 1.44 sec long cut inside a very long recording (2.42 hrs). And the 1.44 sec is very little to pronounce all the words in the reference text : "Of all . Parliament , Council and Commission are working together to deliver ."
Definitely a data issue.
And if the Cut is filtered out, and consequently the CTC stops breaking, it sholud be seen as a good thing...
K.
yes, I think it should be good to filter out such kind of data.
Hello, is there something needed for this to merge from my side ? K.
The root cause is due to bad data. Would it be more appropriate to fix it when preparing the data?
The -2 thing is not a constraint for computing the CTC or the transducer loss.
Well, without that (T - 2) change i was getting inf. value from the CTC loss.
There sholud be no inf. even if the data are prepared badly.
I also did not find any trace of the extra CTC symbols torch.nn.functional.ctc_loss(.) is getting the same set of symbols as transducer loss.
Could you try to reproduce the issue by adding a training example with a very lenghty transcript ? (or I can create a branch to demonstrate it, say repeating the librispeech transcript 100x, just to make the error appear)
Best regards, Karel