icefall
icefall copied to clipboard
AssertionError: CutSet has cuts with duplicated IDs.
I am trying to run Zipformer model using my custom dataset. For that the steps that I have followed are given below:-
-
I have prepared the data by running the command
lhotse kaldi import {train, dev, test}/ 16000 manifests/{train, dev, test}_manifest. -
I have completed the fbank extraction stage (stage 3) of prepare.sh script. which generated the following files and folders which is shown in the figure below:-
-
After this I have prepared BPE based lang which generated the folder
lang_bpe_500containingbpe.model,tokens.txt,transcript_word.txt,unigram_500.model,unigram_500.vocabfiles -
Finally I have run the CLI which is given below:-
./pruned_transducer_stateless7_streaming/train.py --world-size 2 --num-epochs 30 --start-epoch 1 --use-fp16 1 --exp-dir pruned_transducer_stateless7_streaming/exp --max-duration 200 --enable-musan False
I am getting the following error:-
Traceback (most recent call last):
File "/DATA/Sougata/icefall_toolkit/icefall/egs/Kui/ASR/./pruned_transducer_stateless7_streaming/train.py", line 1273, in <module>
main()
File "/DATA/Sougata/icefall_toolkit/icefall/egs/Kui/ASR/./pruned_transducer_stateless7_streaming/train.py", line 1264, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 281, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 237, in start_processes
while not context.join():
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap
fn(i, *args)
File "/DATA/Sougata/icefall_toolkit/icefall/egs/Kui/ASR/pruned_transducer_stateless7_streaming/train.py", line 1144, in run
train_one_epoch(
File "/DATA/Sougata/icefall_toolkit/icefall/egs/Kui/ASR/pruned_transducer_stateless7_streaming/train.py", line 915, in train_one_epoch
valid_info = compute_validation_loss(
File "/DATA/Sougata/icefall_toolkit/icefall/egs/Kui/ASR/pruned_transducer_stateless7_streaming/train.py", line 737, in compute_validation_loss
for batch_idx, batch in enumerate(valid_dl):
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
data = self._next_data()
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
return self._process_data(data)
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
data.reraise()
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise
raise exception
AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 53, in fetch
data = self.dataset[possibly_batched_index]
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/lhotse/dataset/speech_recognition.py", line 99, in __getitem__
validate_for_asr(cuts)
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/lhotse/dataset/speech_recognition.py", line 205, in validate_for_asr
validate(cuts)
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/lhotse/qa.py", line 39, in validate
validator(obj, read_data=read_data)
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/lhotse/qa.py", line 512, in validate_cut_set
assert ids.most_common(1)[0][1] <= 1, "CutSet has cuts with duplicated IDs."
AssertionError: CutSet has cuts with duplicated IDs.
I have also tried it for another dataset. It was giving me the following error after running the CLI as mentioned in point 4.
File "/DATA/Sougata/icefall_toolkit/icefall/egs/Hindi/ASR/./pruned_transducer_stateless7_streaming/train.py", line 1273, in <module>
main()
File "/DATA/Sougata/icefall_toolkit/icefall/egs/Hindi/ASR/./pruned_transducer_stateless7_streaming/train.py", line 1264, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 281, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 237, in start_processes
while not context.join():
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap
fn(i, *args)
File "/DATA/Sougata/icefall_toolkit/icefall/egs/Hindi/ASR/pruned_transducer_stateless7_streaming/train.py", line 1144, in run
train_one_epoch(
File "/DATA/Sougata/icefall_toolkit/icefall/egs/Hindi/ASR/pruned_transducer_stateless7_streaming/train.py", line 814, in train_one_epoch
loss, loss_info = compute_loss(
File "/DATA/Sougata/icefall_toolkit/icefall/egs/Hindi/ASR/pruned_transducer_stateless7_streaming/train.py", line 685, in compute_loss
simple_loss, pruned_loss = model(
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1411, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/DATA/Sougata/icefall_toolkit/icefall/egs/Hindi/ASR/pruned_transducer_stateless7_streaming/model.py", line 121, in forward
assert torch.all(x_lens > 0)
AssertionError
Hi @mukherjeesougata, I think you should delete files with duplicate IDs before creating manifest.
hi,
since you are directly importing a kaldi-fmt data dir, you are suggested to use utils/fix_data_dir.sh (of sorts, i cannot recall the exact name of the script at the time) to remove entries with duplicated keys to begin with.
best jin
I have already used utils/fix_data_dir.sh script to sort the train, dev, and test folders which contained text, wav.scp and utt2spk text files to remove duplicates. In addition to this, I have used the following code to find duplicates ids from the Kui_cuts_train.jsonl, Kui_cuts_dev.jsonl and Kui_cuts_test.jsonl :-
# Re-importing necessary libraries and re-executing the task due to reset state.
import json
from collections import Counter
# File path after state reset
file_path = '/DATA/Sougata/icefall_toolkit/icefall/egs/Kui/ASR/data/unzipped_files/Kui_cuts_test.jsonl'
# Reading the JSONL file and extracting IDs
ids = []
with open(file_path, 'r') as file:
for line in file:
data = json.loads(line)
if 'id' in data:
#print('id',data['id'])
ids.append(data['id'])
# Identifying duplicate IDs
id_counts = Counter(ids)
print(id_counts)
duplicates = [id_ for id_, count in id_counts.items() if count > 1]
print(duplicates, len(duplicates))
The above code is giving the o/p [] 0 which means the files Kui_cuts_train.jsonl, Kui_cuts_dev.jsonl and Kui_cuts_test.jsonl doesn't have any duplicate ids