audio Train Emformer RNN-T using provided recipe cannot converge

Train Emformer RNN-T using provided recipe cannot converge

Open thanhtvt opened this issue 3 years ago • 0 comments

🐛 Describe the bug

Thank you for the amazing implementation of Emformer and the training recipe of it. But when I train from scratch using this recipe, the model cannot converge and stop improving after only about 6-7 epochs (tensorboard image below).

tensorboard

I trained on 100-hour training set of Librispeech. I also have re-created global_stats.json file to fit to only 100-hour data using this script (I have commented out other LibriSpeech sets of course)

Is there something I miss or something? Can you please check it out for me? Thank you very much!

Versions

PyTorch version: 1.12.1
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2
Libc version: glibc-2.27

Python version: 3.8.13 (default, Mar 28 2022, 11:38:47)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.0-122-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 9.1.85
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060
Nvidia driver version: 510.85.02
cuDNN version: Probably one of the following:
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.1
[pip3] pytorch-lightning==1.7.0
[pip3] torch==1.12.1
[pip3] torch-tb-profiler==0.4.0
[pip3] torchaudio==0.12.1
[pip3] torchmetrics==0.9.3
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.6.0              hecad31d_10    conda-forge
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-service               2.4.0            py38h7f8727e_0  
[conda] mkl_fft                   1.3.1            py38hd3c417c_0  
[conda] mkl_random                1.2.2            py38h51133e4_0  
[conda] numpy                     1.23.1           py38h6c91a56_0  
[conda] numpy-base                1.23.1           py38ha15fc14_0  
[conda] pytorch                   1.12.1          py3.8_cuda11.6_cudnn8.3.2_0    pytorch
[conda] pytorch-lightning         1.7.0                    pypi_0    pypi
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torch                     1.12.1                   pypi_0    pypi
[conda] torch-tb-profiler         0.4.0                    pypi_0    pypi
[conda] torchaudio                0.12.1                   pypi_0    pypi
[conda] torchmetrics              0.9.3                    pypi_0    pypi

Aug 12 '22 03:08 thanhtvt

audio audio copied to clipboard

Train Emformer RNN-T using provided recipe cannot converge

🐛 Describe the bug

Versions

audio
audio copied to clipboard