neural_sp icon indicating copy to clipboard operation
neural_sp copied to clipboard

Need help with instructions to reproduce experiments

Open jiwidi opened this issue 3 years ago • 10 comments

Hi Hiro!

First, thank you for the repo. I've been following for a while and I saw you implement a big number of dl architectures.

So far I was only watching the repo from time to time, but now I would like to see If I can reproduce some results and eventually use it with custom datasets. I tried to reproduce librispeech experiment without success and need some help with it.

I went ahead and follow the installation instructions:

# Set path to CUDA, NCCL
CUDAROOT=/usr/local/cuda
NCCL_ROOT=/usr/local/nccl

export CPATH=$NCCL_ROOT/include:$CPATH
export LD_LIBRARY_PATH=$NCCL_ROOT/lib/:$CUDAROOT/lib64:$LD_LIBRARY_PATH
export LIBRARY_PATH=$NCCL_ROOT/lib/:$LIBRARY_PATH
export CUDA_HOME=$CUDAROOT
export CUDA_PATH=$CUDAROOT
export CPATH=$CUDA_PATH/include:$CPATH  # for warp-rnnt

# Install miniconda, python libraries, and other tools
cd tools
make 

Kaldi complained about a few libraries but after installing them manually the make command run successfully. After this a conda environment was created under my path: /mnt/kingston/github/neural_sp/tools/miniconda. I activatated it with conda activate /mnt/kingston/github/neural_sp/tools/miniconda and proceeded to run

cd examples/librispeech/s5/
sh run.sh

But got the following output:

============================================================================
                                LibriSpeech                               
============================================================================
run.sh: 14: ./path.sh: source: not found
run.sh: 34: utils/parse_options.sh: Syntax error: Bad for loop variable

Have I missed an important part of the installation process? Do you have a more detailed list of steps I should follow in order to reproduce? Any help would be very much appreciated thanks.

jiwidi avatar Dec 25 '20 18:12 jiwidi

@jiwidi I'll fix Makefile. Please retry it after the next PR.

hirofumi0810 avatar Dec 28 '20 22:12 hirofumi0810

@hirofumi0810 Hi again,

So I tried to run the same steps as in the original post and now im stuck with the warprnnt make step. My output is:

git clone https://github.com/HawkAaron/warp-transducer.git /mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer
Cloning into '/mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer'...
remote: Enumerating objects: 11, done.
remote: Counting objects: 100% (11/11), done.
remote: Compressing objects: 100% (10/10), done.
remote: Total 905 (delta 1), reused 5 (delta 1), pack-reused 894
Receiving objects: 100% (905/905), 248.13 KiB | 622.00 KiB/s, done.
Resolving deltas: 100% (462/462), done.
# Note: Requires gcc>=5.0 to build extensions with pytorch>=1.0
if . /mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/bin/activate && python -c 'import torch as t;assert t.__version__[0] == "1"' &> /dev/null; then \
        . /mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/bin/activate && python -c "from distutils.version import LooseVersion as V;assert V('10.2.0') >= V('5.0'), 'Requires gcc>=5.0'"; \
fi
. /mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/bin/activate; cd /mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer && mkdir build && cd build && cmake .. && make; true
-- The C compiler identification is GNU 10.2.0
-- The CXX compiler identification is GNU 10.2.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found version "11.1") 
-- cuda found TRUE
-- Building shared library with GPU support
-- Configuring done
-- Generating done
-- Build files have been written to: /mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer/build
make[1]: Entering directory '/mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer/build'
make[2]: Entering directory '/mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer/build'
make[3]: Entering directory '/mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer/build'
[  7%] Building NVCC (Device) object CMakeFiles/warprnnt.dir/src/warprnnt_generated_rnnt_entrypoint.cu.o
nvcc fatal   : Unsupported gpu architecture 'compute_30'
CMake Error at warprnnt_generated_rnnt_entrypoint.cu.o.cmake:220 (message):
  Error generating
  /mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer/build/CMakeFiles/warprnnt.dir/src/./warprnnt_generated_rnnt_entrypoint.cu.o


make[3]: *** [CMakeFiles/warprnnt.dir/build.make:65: CMakeFiles/warprnnt.dir/src/warprnnt_generated_rnnt_entrypoint.cu.o] Error 1
make[3]: Leaving directory '/mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer/build'
make[2]: *** [CMakeFiles/Makefile2:191: CMakeFiles/warprnnt.dir/all] Error 2
make[2]: Leaving directory '/mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer/build'
make[1]: *** [Makefile:130: all] Error 2
make[1]: Leaving directory '/mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer/build'
. /mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/bin/activate; cd /mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer/pytorch_binding && python setup.py install
Could not find libwarprnnt.so in ../build.
Build warp-rnnt and set WARP_RNNT_PATH to the location of libwarprnnt.so (default is '../build')
make: *** [Makefile:93: warp-transducer.done] Error 1

Seems like the error is

nvcc fatal   : Unsupported gpu architecture 'compute_30'

I have a rtx 3090 from the latest nvidia gen, do you know if this repo is updated to compile with them? Also, since I want to test the LAS and transformer architecture in librispeech recipe I think i wont need the transducer right? Any way to skip this step?

Thanks

jiwidi avatar Dec 30 '20 12:12 jiwidi

I found this PR on the repo with support for compute 30 https://github.com/HawkAaron/warp-transducer/pull/76, will give it a try and come back

EDIT: Managed to compile it with the branch at https://github.com/ncilfone/warp-transducer/tree/3691b3fa5483e911645738a7894c48fe1f116c9b.

Also discovered I couldnt run the run.sh script with sh run.sh since it will get the same error:

============================================================================
                                LibriSpeech                               
============================================================================
run.sh: 14: ./path.sh: source: not found
run.sh: 34: utils/parse_options.sh: Syntax error: Bad for loop variable

It has to be run with ./run.sh --gpu 1 . This downloads all the data and does some preprocessing it stops during the data prep, just stops the script with no error.

It fails on data_prep.sh:

    for part in dev-clean test-clean dev-other test-other train-clean-100 train-clean-360 train-other-500; do
        # use underscore-separated names in data directories.
        local/data_prep.sh ${data_download_path}/LibriSpeech/${part} ${data}/$(echo ${part} | sed s/-/_/g) || exit 1;
    done

Specifically on utils/validate_data_dir.sh --no-feats $dst || exit 1;

But doesnt give any specific output or complains, the full run.sh output:

============================================================================
                                LibriSpeech                               
============================================================================
============================================================================
                       Data Preparation (stage:0)                          
============================================================================
local/download_and_untar.sh: data part dev-clean was already successfully extracted, nothing to do.
local/download_and_untar.sh: data part test-clean was already successfully extracted, nothing to do.
local/download_and_untar.sh: data part dev-other was already successfully extracted, nothing to do.
local/download_and_untar.sh: data part test-other was already successfully extracted, nothing to do.
local/download_and_untar.sh: data part train-clean-100 was already successfully extracted, nothing to do.
local/download_and_untar.sh: data part train-clean-360 was already successfully extracted, nothing to do.
local/download_and_untar.sh: data part train-other-500 was already successfully extracted, nothing to do.
Downloading file '3-gram.arpa.gz' into '/mnt/kingston/asr-datasets/neural-sp//local/lm'...
'3-gram.arpa.gz' already exists and appears to be complete
Downloading file '3-gram.pruned.1e-7.arpa.gz' into '/mnt/kingston/asr-datasets/neural-sp//local/lm'...
'3-gram.pruned.1e-7.arpa.gz' already exists and appears to be complete
Downloading file '3-gram.pruned.3e-7.arpa.gz' into '/mnt/kingston/asr-datasets/neural-sp//local/lm'...
'3-gram.pruned.3e-7.arpa.gz' already exists and appears to be complete
Downloading file '4-gram.arpa.gz' into '/mnt/kingston/asr-datasets/neural-sp//local/lm'...
'4-gram.arpa.gz' already exists and appears to be complete
Downloading file 'g2p-model-5' into '/mnt/kingston/asr-datasets/neural-sp//local/lm'...
'g2p-model-5' already exists and appears to be complete
Downloading file 'librispeech-lm-corpus.tgz' into '/mnt/kingston/asr-datasets/neural-sp//local/lm'...
'librispeech-lm-corpus.tgz' already exists and appears to be complete
Downloading file 'librispeech-vocab.txt' into '/mnt/kingston/asr-datasets/neural-sp//local/lm'...
'librispeech-vocab.txt' already exists and appears to be complete
Downloading file 'librispeech-lexicon.txt' into '/mnt/kingston/asr-datasets/neural-sp//local/lm'...
'librispeech-lexicon.txt' already exists and appears to be complete
utils/data/get_utt2dur.sh: segments file does not exist so getting durations from wave files
utils/data/get_utt2dur.sh: could not get utterance lengths from sphere-file headers, using wav-to-duration
utils/data/get_utt2dur.sh: computed /mnt/kingston/asr-datasets/neural-sp//dev_clean/utt2dur
Usage: utils/validate_data_dir.sh [--no-feats] [--no-text] [--non-print] [--no-wav] [--no-spk-sort] <data-dir>
The --no-xxx options mean that the script does not require 
xxx.scp to be present, but it will check it if it is present.
--no-spk-sort means that the script does not require the utt2spk to be 
sorted by the speaker-id in addition to being sorted by utterance-id.
--non-print ignore the presence of non-printable characters.
By default, utt2spk is expected to be sorted by both, which can be 
achieved by making the speaker-id prefixes of the utterance-ids
e.g.: utils/validate_data_dir.sh data/train

jiwidi avatar Dec 30 '20 12:12 jiwidi

@hirofumi0810 I managed to skip last problem skipping the data validation step (assumming al processing went right) and now I'm stuck at the LM training, it fails due to a cudnn error. I think its related with my cuda installation/rtx3090 and the code. This has already happened to me with different frameworks already. I have run pytest on the neural_sp root and all 501 test passed so I dont know how to debug it.

Running:

../../../neural_sp/bin/lm/train.py         --corpus librispeech         --config conf/lm/rnnlm.yaml         --n_gpus 1         --cudnn_benchmark true         --train_set /n/work2/inaguma/corpus/librispeech/dataset_lm/train_100_vocab100_wpbpe10000_external.tsv         --dev_set /n/work2/inaguma/corpus/librispeech/dataset_lm/dev_clean_100_vocab100_wpbpe10000.tsv         --eval_sets /n/work2/inaguma/corpus/librispeech/dataset_lm/dev_other_100_vocab100_wpbpe10000.tsv                  /n/work2/inaguma/corpus/librispeech/dataset_lm/test_clean_100_vocab100_wpbpe10000.tsv                  /n/work2/inaguma/corpus/librispeech/dataset_lm/test_other_100_vocab100_wpbpe10000.tsv         --unit wp         --dict /n/work2/inaguma/corpus/librispeech/dict/train_100_wpbpe10000.txt         --wp_model /n/work2/inaguma/corpus/librispeech/dict/train_100_bpe10000.model         --model_save_dir /n/work2/inaguma/results/librispeech/lm         --stdout true --resume

Generates this error:

2021-01-03 20:36:39,060 neural_sp.models.base line:108 INFO: torch.backends.cudnn.enabled: True
Traceback (most recent call last):
  File "../../../neural_sp/bin/lm/train.py", line 347, in <module>
    save_path = pr.runcall(main)
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/cProfile.py", line 121, in runcall
    return func(*args, **kw)
  File "../../../neural_sp/bin/lm/train.py", line 178, in main
    model.cuda()
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 260, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 117, in _apply
    self.flatten_parameters()
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 113, in flatten_parameters
    self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
[1]    172006 bus error (core dumped)  ../../../neural_sp/bin/lm/train.py --corpus librispeech --config  --n_gpus 1 

Have you encountered this error before? Any tips to solve it or debug it?

jiwidi avatar Jan 03 '21 20:01 jiwidi

@jiwidi --benchmark false in run.sh will fix this.

hirofumi0810 avatar Jan 05 '21 17:01 hirofumi0810

@hirofumi0810 Hi! Thanks for the help.

I tried that and now it fails in another step. It does starts the first minibatch though.

  0%|                                                                                                                           | 0/982390016 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/mnt/kingston/github/neural_sp/examples/librispeech/s5/../../../neural_sp/bin/lm/train.py", line 353, in <module>
    save_path = pr.runcall(main)
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/cProfile.py", line 121, in runcall
    return func(*args, **kw)
  File "/mnt/kingston/github/neural_sp/examples/librispeech/s5/../../../neural_sp/bin/lm/train.py", line 227, in main
    loss, hidden, observation = model(ys_train, state=hidden)
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 141, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/kingston/github/neural_sp/neural_sp/models/lm/lm_base.py", line 55, in forward
    loss, state, observation = self._forward(ys, state)
  File "/mnt/kingston/github/neural_sp/neural_sp/models/lm/lm_base.py", line 63, in _forward
    logits, out, new_state = self.decode(ys_in, state=state, mems=state)
  File "/mnt/kingston/github/neural_sp/neural_sp/models/lm/rnnlm.py", line 220, in decode
    ys_emb = self.glu(ys_emb)
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/kingston/github/neural_sp/neural_sp/models/modules/glu.py", line 26, in forward
    return F.glu(self.fc(xs), dim=-1)
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 67, in forward
    return F.linear(input, self.weight, self.bias)
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/functional.py", line 1354, in linear
    output = input.matmul(weight.t())
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1544176307774/work/aten/src/THC/THCBlas.cu:258

Do you know of anyone who has successfully run this code in rtx 3000 series cards?

jiwidi avatar Jan 06 '21 20:01 jiwidi

@jiwidi Are you able to train ASR models in stage-4? (by skipping stage-3)

hirofumi0810 avatar Jan 11 '21 16:01 hirofumi0810

@hirofumi0810 Hi

Sorry Ive been out the last weeks, its a busy week for me this one but will try it on the weekend. Thanks

jiwidi avatar Jan 27 '21 16:01 jiwidi

facing the same error during installation:nvcc fatal : Unsupported gpu architecture 'compute_30'

agarwalchaitanya avatar Feb 03 '21 08:02 agarwalchaitanya

jiwidi

Hi me and my colleague have run the model with aishell2 recipe on rtx3090. We had the same compute 30 problem and resolved it by commenting out 1 or 2 lines in regarding cmake file

Neukiru avatar Jan 01 '22 07:01 Neukiru