icefall icon indicating copy to clipboard operation
icefall copied to clipboard

AssertionError: CutSet has cuts with duplicated IDs.

Open mukherjeesougata-eros opened this issue 11 months ago • 4 comments

I am trying to run Zipformer model using my custom dataset. For that the steps that I have followed are given below:-

  1. I have prepared the data by running the command lhotse kaldi import {train, dev, test}/ 16000 manifests/{train, dev, test}_manifest.

  2. I have completed the fbank extraction stage (stage 3) of prepare.sh script. which generated the following files and folders which is shown in the figure below:- Zipformer_fbank_Kui

  3. After this I have prepared BPE based lang which generated the folder lang_bpe_500 containing bpe.model, tokens.txt, transcript_word.txt, unigram_500.model, unigram_500.vocab files

  4. Finally I have run the CLI which is given below:- ./pruned_transducer_stateless7_streaming/train.py --world-size 2 --num-epochs 30 --start-epoch 1 --use-fp16 1 --exp-dir pruned_transducer_stateless7_streaming/exp --max-duration 200 --enable-musan False

I am getting the following error:-

Traceback (most recent call last):
  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Kui/ASR/./pruned_transducer_stateless7_streaming/train.py", line 1273, in <module>
    main()
  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Kui/ASR/./pruned_transducer_stateless7_streaming/train.py", line 1264, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 281, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 237, in start_processes
    while not context.join():
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap
    fn(i, *args)
  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Kui/ASR/pruned_transducer_stateless7_streaming/train.py", line 1144, in run
    train_one_epoch(
  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Kui/ASR/pruned_transducer_stateless7_streaming/train.py", line 915, in train_one_epoch
    valid_info = compute_validation_loss(
  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Kui/ASR/pruned_transducer_stateless7_streaming/train.py", line 737, in compute_validation_loss
    for batch_idx, batch in enumerate(valid_dl):
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise
    raise exception
AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 53, in fetch
    data = self.dataset[possibly_batched_index]
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/lhotse/dataset/speech_recognition.py", line 99, in __getitem__
    validate_for_asr(cuts)
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/lhotse/dataset/speech_recognition.py", line 205, in validate_for_asr
    validate(cuts)
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/lhotse/qa.py", line 39, in validate
    validator(obj, read_data=read_data)
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/lhotse/qa.py", line 512, in validate_cut_set
    assert ids.most_common(1)[0][1] <= 1, "CutSet has cuts with duplicated IDs."
AssertionError: CutSet has cuts with duplicated IDs.

mukherjeesougata-eros avatar Dec 28 '24 19:12 mukherjeesougata-eros

I have also tried it for another dataset. It was giving me the following error after running the CLI as mentioned in point 4.

  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Hindi/ASR/./pruned_transducer_stateless7_streaming/train.py", line 1273, in <module>
    main()
  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Hindi/ASR/./pruned_transducer_stateless7_streaming/train.py", line 1264, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 281, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 237, in start_processes
    while not context.join():
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap
    fn(i, *args)
  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Hindi/ASR/pruned_transducer_stateless7_streaming/train.py", line 1144, in run
    train_one_epoch(
  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Hindi/ASR/pruned_transducer_stateless7_streaming/train.py", line 814, in train_one_epoch
    loss, loss_info = compute_loss(
  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Hindi/ASR/pruned_transducer_stateless7_streaming/train.py", line 685, in compute_loss
    simple_loss, pruned_loss = model(
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1411, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Hindi/ASR/pruned_transducer_stateless7_streaming/model.py", line 121, in forward
    assert torch.all(x_lens > 0)
AssertionError

mukherjeesougata-eros avatar Dec 28 '24 20:12 mukherjeesougata-eros

Hi @mukherjeesougata, I think you should delete files with duplicate IDs before creating manifest.

hosythach-jelly avatar Jan 03 '25 02:01 hosythach-jelly

hi,

since you are directly importing a kaldi-fmt data dir, you are suggested to use utils/fix_data_dir.sh (of sorts, i cannot recall the exact name of the script at the time) to remove entries with duplicated keys to begin with.

best jin

JinZr avatar Jan 07 '25 07:01 JinZr

I have already used utils/fix_data_dir.sh script to sort the train, dev, and test folders which contained text, wav.scp and utt2spk text files to remove duplicates. In addition to this, I have used the following code to find duplicates ids from the Kui_cuts_train.jsonl, Kui_cuts_dev.jsonl and Kui_cuts_test.jsonl :-

# Re-importing necessary libraries and re-executing the task due to reset state.
import json
from collections import Counter

# File path after state reset
file_path = '/DATA/Sougata/icefall_toolkit/icefall/egs/Kui/ASR/data/unzipped_files/Kui_cuts_test.jsonl'

# Reading the JSONL file and extracting IDs
ids = []
with open(file_path, 'r') as file:
    for line in file:
        data = json.loads(line)
        if 'id' in data:
            #print('id',data['id'])
            ids.append(data['id'])

# Identifying duplicate IDs
id_counts = Counter(ids)
print(id_counts)
duplicates = [id_ for id_, count in id_counts.items() if count > 1]

print(duplicates, len(duplicates))

The above code is giving the o/p [] 0 which means the files Kui_cuts_train.jsonl, Kui_cuts_dev.jsonl and Kui_cuts_test.jsonl doesn't have any duplicate ids

mukherjeesougata-eros avatar Jan 11 '25 09:01 mukherjeesougata-eros