keras-kaldi
keras-kaldi copied to clipboard
Can this code work on a tensorflow trained model
Hi Kumar
I see in steps_kt/decode_seq.sh ,line of 75
export KERAS_BACKEND=theano.
It seems that this code can only run on theano backend.
But in my case, the model is trained by tensorflow, will it cause any problem during decoding?
It runs with tensorflow as well. I set theano for better parallel decoding. Each tensorflow process occupies all the GPU memory by default, so if we have to share a single GPU for parallel decoding, we need to set the correct GPU fraction for each process based on the number of parallel jobs. Also switching between GPU and CPU isn't straightforward in tensorflow. Training and testing can be done with different backends, no issues.
Thanks!
Why I ask this question is because i run this code with a tensorflow trained LSTM model on CHiME3 task and decode with theano. The training is well done but decoding is unusual.
The model seems no problem as you said.
However when I run decode_seq.sh
with nj=4
the thread 1,3,4 will be killed during processing the about 120 th utterance, while the thread 2 will process normally.
After that I run the same task with different threads setting, it still be killed at different utterances.
Meanwhile I have tested the original CHiME3 decoding script with same data and it can run normally. The difference between your code and original kaldi decoding code is that you use the nnet-forward-seq.py
instead of nnet-forward.cc
. Perhaps there are some memory control issue on nnet-forward-seq.py
and it can not process too many files.
Have you met this issue before?
The error log for thread 1 is
LOG (latgen-faster-mapped[5.2.124~1-70748]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:286) Log-like per frame for utterance F05_446C0204_CAF.CH5_REAL is 0.109935 over 456 frames. Killed F05_446C0204_STR.CH5_REAL THE CONSENSUS WAS THAT A NEW PIECE OF PAPER ISN'T REQUIRED SAID ONE U. S. TO <UNK> LOG (latgen-faster-mapped[5.2.124~1-70748]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:286) Log-like per frame for utterance F05_446C0204_STR.CH5_REAL is 0.178649 over 464 frames. WARNING (latgen-faster-mapped[5.2.124~1-70748]:Close():kaldi-io.cc:501) Pipe apply-cmvn --norm-vars=true --utt2spk=ark:data/et05_real_noisy/split4/1/utt2spk scp:data/et05_real_noisy/split4/1/cmvn.scp scp:data/et05_real_noisy/split4/1/feats.scp ark:- | add-deltas ark:- ark:- | steps_kt/nnet-forward-seq.py exp/dnn_5b/dnn.nnet.h5 exp/dnn_5b/dnn.priors.csv 11 | had nonzero return status 35072 ERROR (latgen-faster-mapped[5.2.124~1-70748]:~SequentialTableReaderArchiveImpl():util/kaldi-table-inl.h:678) TableReader: error detected closing archive 'apply-cmvn --norm-vars=true --utt2spk=ark:data/et05_real_noisy/split4/1/utt2spk scp:data/et05_real_noisy/split4/1/cmvn.scp scp:data/et05_real_noisy/split4/1/feats.scp ark:- | add-deltas ark:- ark:- | steps_kt/nnet-forward-seq.py exp/dnn_5b/dnn.nnet.h5 exp/dnn_5b/dnn.priors.csv 11 |'
for thread 3 and 4 is
LOG (latgen-faster-mapped[5.2.124~1-70748]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:286) Log-like per frame for utterance M05_442C0204_CAF.CH5_REAL is 0.17528 over 962 frames. Killed M05_442C0205_BUS.CH5_REAL PROCEEDS HE SAID IT PLANS TO PAY FOR YEARS ON JULY THIRTY FIRST TO STOCK OF RECORD JULY SECOND LOG (latgen-faster-mapped[5.2.124~1-70748]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:286) Log-like per frame for utterance M05_442C0205_BUS.CH5_REAL is 0.148915 over 899 frames. ERROR (latgen-faster-mapped[5.2.124~1-70748]:Read():kaldi-matrix.cc:1465) Failed to read matrix from stream. File position at start is -1, currently -1
It may be a memory issue, especially if you are running all the jobs on the same machine. Parallelising it on a sun grid engine can help. Otherwise, an easy but time-taking solution is to use nj=1
.
If the problem persists, it might be due to corrupt feature archive. One way is to check if feat-to-len
runs successfully till the end and gives non-zero lengths to all the utterances.
Let me know if these help.
Thanks for your advice.
The features are OK after feat-to-len
checking each of them.
I tried to use tensorflow backend with GPU decoding, the problem did not appear. memory can be released normally.
However in theano CPU backend, even with nj=1
the used memory is getting larger and larger then crashed. If the memory is large enough, the issue can be avoided.
Maybe it cause by theano? or compatibility between theano and the code or the tensorflow trained model.
Seems that it is a theano problem. My theano version is 0.9.0. It will always cause some memory issues. After I downgrade to 0.8.2, the memory runs normally.
Okay.