kaldi icon indicating copy to clipboard operation
kaldi copied to clipboard

A bug in Kaldi multi_cn run.sh

Open liyuhui opened this issue 2 years ago • 3 comments

i use multi_cn/s5/run.sh train model, in the last step local/chain/run_cnn_tdnn.sh , find below bug:

run.pl: job failed, log is in exp/chain_cleaned/tdnn_cnn_1a_sp/log/train.1.3.log
2022-03-28 05:58:19,735 [/opt/kaldi/egs/multi_cn/s5/steps/libs/common.py:236 - background_command_waiter - ERROR ] Command exited with status 1: run.pl --mem 4G --gpu 1 exp/chain_cleaned/tdnn_cnn_1a_sp/log/train.1.3.log                     nnet3-chain-train --use-gpu=wait                      --apply-deriv-weights=False                     --l2-regularize=0.0 --leaky-hmm-coefficient=0.1                     --read-cache=exp/chain_cleaned/tdnn_cnn_1a_sp/cache.1  --xent-regularize=0.1                                          --print-interval=10 --momentum=0.0                     --max-param-change=2.0                     --backstitch-training-scale=0.0                     --backstitch-training-interval=1                     --l2-regularize-factor=0.3333333333333333 --optimization.memory-compression-level=2                     --srand=1                     "nnet3-am-copy --raw=true --learning-rate=0.00044762974304870596 --scale=1.0 exp/chain_cleaned/tdnn_cnn_1a_sp/1.mdl - |nnet3-copy --edits='set-dropout-proportion name=* proportion=0.0' - - |" exp/chain_cleaned/tdnn_cnn_1a_sp/den.fst                     "ark,bg:nnet3-chain-copy-egs                          --frame-shift=0                         ark:exp/chain_cleaned/tdnn_cnn_1a_sp/egs/cegs.6.ark ark:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=1 ark:- ark:- | nnet3-chain-merge-egs                         --minibatch-size=128,64 ark:- ark:- |"                     exp/chain_cleaned/tdnn_cnn_1a_sp/2.3.raw

the exp/chain_cleaned/tdnn_cnn_1a_sp/log/train.1.3.log says:

ERROR (nnet3-chain-train[5.5]:ExecuteCommand():nnet-compute.cc:445) Error running command c247: tdnnf17.relu.Propagate(NULL, m126, &m126)

[ Stack-Trace: ]
/opt/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x82c) [0x7f9787f912aa]
nnet3-chain-train(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x4115d1]
/opt/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::ExecuteCommand()+0x13d1) [0x7f9789fef00f]
/opt/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::Run()+0x18a) [0x7f9789fef22e]
/opt/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetChainTrainer::TrainInternal(kaldi::nnet3::NnetChainExample const&, kaldi::nnet3::NnetComputation const&)+0x5b) [0x7f978a04116d]
/opt/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetChainTrainer::Train(kaldi::nnet3::NnetChainExample const&)+0x19d) [0x7f978a0415c5]
nnet3-chain-train(main+0x84d) [0x4103f3]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f97870e4830]
nnet3-chain-train(_start+0x29) [0x40fad9]

Would someone be able to help debug this issue?

liyuhui avatar Mar 29 '22 03:03 liyuhui

Just guessing from my own experience. Do you have enough GPU memory and have you set the GPU to exclusive mode?

svenha avatar May 24 '22 11:05 svenha

yeah, I would also guess gpu memory or system memory or something like that. y.

On Tue, May 24, 2022 at 7:31 AM svenha @.***> wrote:

Just guessing from my own experience. Do you have enough GPU memory and have you set the GPU to exclusive mode?

— Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4718#issuecomment-1135795746, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX3JMIQTWSSUDI3VEJTVLS4XXANCNFSM5R474PZQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jtrmal avatar May 24 '22 14:05 jtrmal

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

stale[bot] avatar Jul 30 '22 22:07 stale[bot]