icefall icon indicating copy to clipboard operation
icefall copied to clipboard

Issues when I try to run the yesno recipe for testing purposes

Open umbertocappellazzo opened this issue 1 year ago • 12 comments

Hi! After installing all the dependencies for icefall, I tried to run the yesno recipe in order to test whether the installation is successful. During the data preparation stage, I get this error:

stek@99a34efcadb3:/cappellazzo/icefall_repo/icefall/egs/yesno/ASR$ ./prepare.sh
2023-06-02 09:47:46 (prepare.sh:27:main) dl_dir: /cappellazzo/icefall_repo/icefall/egs/yesno/ASR/download
2023-06-02 09:47:46 (prepare.sh:30:main) Stage 0: Download data
/cappellazzo/icefall_repo/icefall/egs/yesno/ASR/download/waves_yesno.tar.gz: 100%|███| 4.70M/4.70M [00:00<00:00, 30.5MB/s]
2023-06-02 09:47:48 (prepare.sh:39:main) Stage 1: Prepare yesno manifest
2023-06-02 09:47:50 (prepare.sh:45:main) Stage 2: Compute fbank for yesno
2023-06-02 09:47:52,082 INFO [compute_fbank_yesno.py:65] Processing train
Extracting and storing features: 100%|███████████████████████████████████████████████████| 90/90 [00:00<00:00, 256.03it/s]
2023-06-02 09:47:52,457 INFO [compute_fbank_yesno.py:65] Processing test
Extracting and storing features: 100%|███████████████████████████████████████████████████| 30/30 [00:00<00:00, 354.99it/s]
2023-06-02 09:47:52 (prepare.sh:51:main) Stage 3: Prepare lang
2023-06-02 09:47:55 (prepare.sh:63:main) Stage 4: Prepare G
/project/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):79
[I] Reading \data\ section.
/project/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):140
[I] Reading \1-grams: section.
2023-06-02 09:47:55 (prepare.sh:89:main) Stage 5: Compile HLG
2023-06-02 09:47:56,632 INFO [compile_hlg.py:124] Processing data/lang_phone
2023-06-02 09:47:56,633 INFO [lexicon.py:171] Converting L.pt to Linv.pt
2023-06-02 09:47:56,638 INFO [compile_hlg.py:48] Building ctc_topo. max_token_id: 3
2023-06-02 09:47:56,638 INFO [compile_hlg.py:52] Loading G.fst.txt
2023-06-02 09:47:56,639 INFO [compile_hlg.py:62] Intersecting L and G
2023-06-02 09:47:56,639 INFO [compile_hlg.py:64] LG shape: (4, None)
2023-06-02 09:47:56,639 INFO [compile_hlg.py:66] Connecting LG
2023-06-02 09:47:56,639 INFO [compile_hlg.py:68] LG shape after k2.connect: (4, None)
2023-06-02 09:47:56,640 INFO [compile_hlg.py:70] <class 'torch.Tensor'>
2023-06-02 09:47:56,640 INFO [compile_hlg.py:71] Determinizing LG
2023-06-02 09:47:56,640 INFO [compile_hlg.py:74] <class '_k2.ragged.RaggedTensor'>
2023-06-02 09:47:56,640 INFO [compile_hlg.py:76] Connecting LG after k2.determinize
2023-06-02 09:47:56,640 INFO [compile_hlg.py:79] Removing disambiguation symbols on LG
2023-06-02 09:47:56,641 INFO [compile_hlg.py:91] LG shape after k2.remove_epsilon: (6, None)
Traceback (most recent call last):
  File "/cappellazzo/icefall_repo/icefall/egs/yesno/ASR/./local/compile_hlg.py", line 136, in <module>
    main()
  File "/cappellazzo/icefall_repo/icefall/egs/yesno/ASR/./local/compile_hlg.py", line 126, in main
    HLG = compile_HLG(lang_dir)
  File "/cappellazzo/icefall_repo/icefall/egs/yesno/ASR/./local/compile_hlg.py", line 93, in compile_HLG
    LG = k2.connect(LG)
  File "/opt/conda/lib/python3.10/site-packages/k2/fsa_algo.py", line 522, in connect
    if fsa.properties & fsa_properties.ACCESSIBLE != 0 and \
  File "/opt/conda/lib/python3.10/site-packages/k2/fsa.py", line 446, in properties
    raise RuntimeError(
RuntimeError: The fsa attribute (labels) has been inappropriately modified like:
    fsa.labels[xxx] = yyy
The correct way should be like:
    labels = fsa.labels
    labels[xxx] = yyy
    fsa.labels = labels

Any idea on how to fix it? I don't know whether it depends on my installation or it is a issue with the recipe itself. Thank you

umbertocappellazzo avatar Jun 02 '23 14:06 umbertocappellazzo

Please install the latest k2.

csukuangfj avatar Jun 02 '23 14:06 csukuangfj

I used the following command for installing k2 but it does not install the latest version.

$ conda install -c k2-fsa -c pytorch -c nvidia k2 pytorch=1.13.0 pytorch-cuda=11.7 python=3.8

It does install version 1.23.4 and not the 1.24.1

umbertocappellazzo avatar Jun 02 '23 16:06 umbertocappellazzo

We have prebuilt wheels. You can find them in the installation doc. Please use pip install or compile k2 from source.

csukuangfj avatar Jun 02 '23 20:06 csukuangfj

I opted for conda installation since in the official documentation it's recommended that we install k2 using conda

umbertocappellazzo avatar Jun 02 '23 21:06 umbertocappellazzo

Sorry, the conda packages have not been updated.

csukuangfj avatar Jun 03 '23 00:06 csukuangfj

In the end I managed to install it via pip. However, I would recommend that you update the installation instructions, at least put the "recommended" to the pip installation rather than conda, it would save time for the upcoming users.

umbertocappellazzo avatar Jun 03 '23 16:06 umbertocappellazzo

After carrying out data preparation for LibriSpeech (it is successful), I'm trying to run the training stage for conformer_ctc2 (./conformer_ctc/train.py --full-libri False --num-epochs 30), but I get this error:

2023-06-06 13:24:41,697 INFO [train.py:770] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
[W] /stek/matasso/k2/build/temp.linux-x86_64-cpython-310/k2/csrc/pytorch_context.cc:81:void k2::InitHasCuda() k2 was not compiled with CUDA. Return a CPU context.
Segmentation fault (core dumped)

Any solution to this error? Can it be related to the installation phase?

umbertocappellazzo avatar Jun 06 '23 13:06 umbertocappellazzo

The log says you have installed a CPU version of k2. Please switch to a CUDA version.

csukuangfj avatar Jun 06 '23 13:06 csukuangfj

Please show the output of

python3 -m k2.version

and please describe how you installed k2.

csukuangfj avatar Jun 06 '23 14:06 csukuangfj

My colleague and I tried to start from a new pytorch docker image and installed all the dependencies again, and it does work. Probably some days ago we had some conflicts due to the multiple attempts to install k2, and the CUDA libraries were not visible to k2 or something similar. Btw, now I'm running a recipe for librispeech and it's working properly. I'm gonna close the issue in a few days if everything goes smoothly. Thank you!

umbertocappellazzo avatar Jun 07 '23 08:06 umbertocappellazzo

Hi, I ran the conformer_ctc recipe with the following command:

./conformer_ctc/train.py --full-libri False --num-epochs 30 

and the training was ok and also the decoding step was ok. With this configuration, the architecture entails a transformer decoder, yet now I'd like to re-run the same recipe with only ctc, thus dispensing with the decoder. I was trying this:

./conformer_ctc/train.py --num-decoder-layers 0 --full-libri False --num-epochs 30

because I read in the --help that --num-decoder-layers "Setting this to 0 will not create the decoder at all (pure CTC model)". Anyhow, I get this error:

root@4677c86aefb3:/stek/cappellazzo/JSALT2023/icefall/egs/librispeech/ASR# ./conformer_ctc/train.py --num-decoder-layers 0 --full-libri False --num-epochs 30
fatal: detected dubious ownership in repository at '/stek/cappellazzo/JSALT2023/icefall'
To add an exception for this directory, call:

	git config --global --add safe.directory /stek/cappellazzo/JSALT2023/icefall
fatal: detected dubious ownership in repository at '/stek/cappellazzo/JSALT2023/icefall'
To add an exception for this directory, call:

	git config --global --add safe.directory /stek/cappellazzo/JSALT2023/icefall
fatal: detected dubious ownership in repository at '/stek/cappellazzo/JSALT2023/icefall'
To add an exception for this directory, call:

	git config --global --add safe.directory /stek/cappellazzo/JSALT2023/icefall
2023-06-08 12:01:38,958 INFO [train.py:610] Training started
2023-06-08 12:01:38,959 INFO [train.py:611] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'use_feat_batchnorm': True, 'attention_dim': 512, 'nhead': 8, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'weight_decay': 1e-06, 'warm_step': 80000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '', 'k2-git-date': '', 'lhotse-version': '1.16.0.dev+git.cf4446d.clean', 'torch-version': '2.0.1', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.1', 'icefall-git-branch': None, 'icefall-git-sha1': None, 'icefall-git-date': None, 'icefall-path': '/stek/cappellazzo/JSALT2023/icefall', 'k2-path': '/opt/conda/lib/python3.10/site-packages/k2-1.24.3.dev20230606+cuda11.7.torch2.0.1-py3.10-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/opt/conda/lib/python3.10/site-packages/lhotse/__init__.py', 'hostname': '4677c86aefb3', 'IP address': '172.17.0.16'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 0, 'exp_dir': PosixPath('conformer_ctc/exp'), 'lang_dir': PosixPath('data/lang_bpe_500'), 'att_rate': 0.8, 'num_decoder_layers': 0, 'lr_factor': 5.0, 'seed': 42, 'full_libri': False, 'mini_libri': False, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 200.0, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures'}
2023-06-08 12:01:39,281 INFO [lexicon.py:168] Loading pre-compiled data/lang_bpe_500/Linv.pt
2023-06-08 12:01:39,675 INFO [train.py:659] About to create model
2023-06-08 12:01:41,703 INFO [asr_datamodule.py:413] About to get train-clean-100 cuts
2023-06-08 12:01:41,704 INFO [asr_datamodule.py:232] Enable MUSAN
2023-06-08 12:01:41,704 INFO [asr_datamodule.py:233] About to get Musan cuts
2023-06-08 12:01:43,705 INFO [asr_datamodule.py:257] Enable SpecAugment
2023-06-08 12:01:43,705 INFO [asr_datamodule.py:258] Time warp factor: 80
2023-06-08 12:01:43,705 INFO [asr_datamodule.py:268] Num frame mask: 10
2023-06-08 12:01:43,706 INFO [asr_datamodule.py:281] About to create train dataset
2023-06-08 12:01:43,706 INFO [asr_datamodule.py:308] Using DynamicBucketingSampler.
2023-06-08 12:01:46,161 INFO [asr_datamodule.py:323] About to create train dataloader
2023-06-08 12:01:46,162 INFO [asr_datamodule.py:451] About to get dev-clean cuts
2023-06-08 12:01:46,163 INFO [asr_datamodule.py:458] About to get dev-other cuts
2023-06-08 12:01:46,163 INFO [asr_datamodule.py:354] About to create dev dataset
2023-06-08 12:01:46,355 INFO [asr_datamodule.py:371] About to create dev dataloader
2023-06-08 12:01:46,355 INFO [train.py:770] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
Traceback (most recent call last):
  File "/stek/cappellazzo/JSALT2023/icefall/egs/librispeech/ASR/./conformer_ctc/train.py", line 819, in <module>
    main()
  File "/stek/cappellazzo/JSALT2023/icefall/egs/librispeech/ASR/./conformer_ctc/train.py", line 812, in main
    run(rank=0, world_size=1, args=args)
  File "/stek/cappellazzo/JSALT2023/icefall/egs/librispeech/ASR/./conformer_ctc/train.py", line 714, in run
    scan_pessimistic_batches_for_oom(
  File "/stek/cappellazzo/JSALT2023/icefall/egs/librispeech/ASR/./conformer_ctc/train.py", line 778, in scan_pessimistic_batches_for_oom
    loss, _ = compute_loss(
  File "/stek/cappellazzo/JSALT2023/icefall/egs/librispeech/ASR/./conformer_ctc/train.py", line 424, in compute_loss
    att_loss = mmodel.decoder_forward(
  File "/stek/cappellazzo/JSALT2023/icefall/egs/librispeech/ASR/conformer_ctc/transformer.py", line 287, in decoder_forward
    tgt = self.decoder_embed(ys_in_pad)  # (N, T) -> (N, T, C)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Conformer' object has no attribute 'decoder_embed'. Did you mean: 'encoder_embed'?

Any clue? I don't know if setting --num-decoder-layers 0 is enough for running pure ctc training.

umbertocappellazzo avatar Jun 08 '23 12:06 umbertocappellazzo

In order to run "pure-ctc" training, both --num-encoder-layers and --att-rate MUST be set to 0 and 0. , respectively. Only setting --num-encoder-layers 0 results in the error I attached above.

umbertocappellazzo avatar Jun 09 '23 08:06 umbertocappellazzo