eesen
eesen copied to clipboard
nan Obj value
Hi, guys:
I'm trying CTC on a big dataset more than 2000hours, using steps/train_ctc_parallel_h.sh --nj 3
job 1
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 280196 sequences (343.467Hr): Obj(log[Pzx]) = -34.12 TokenAcc = 65.5557%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 290214 sequences (356.577Hr): Obj(log[Pzx]) = -35.0829 TokenAcc = 66.0633%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 300228 sequences (369.041Hr): Obj(log[Pzx]) = -35.5174 TokenAcc = 64.8673%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 310238 sequences (382.234Hr): Obj(log[Pzx]) = -35.9657 TokenAcc = 65.8774%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 320254 sequences (394.763Hr): Obj(log[Pzx]) = -33.7356 TokenAcc = 66.6929%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 330265 sequences (407.316Hr): Obj(log[Pzx]) = -32.8957 TokenAcc = 67.0366%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 340269 sequences (420.524Hr): Obj(log[Pzx]) = -36.0733 TokenAcc = 66.5803%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 350272 sequences (433.553Hr): Obj(log[Pzx]) = -3.06908e+29 TokenAcc = 13.6926%
job 2
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 290240 sequences (361.23Hr): Obj(log[Pzx]) = -34.8453 TokenAcc = 65.5767%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 300251 sequences (374.192Hr): Obj(log[Pzx]) = -34.6744 TokenAcc = 65.9575%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 310255 sequences (386.489Hr): Obj(log[Pzx]) = -32.8376 TokenAcc = 66.4874%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 320258 sequences (399.27Hr): Obj(log[Pzx]) = -34.0779 TokenAcc = 66.6884%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 330274 sequences (411.385Hr): Obj(log[Pzx]) = -31.8291 TokenAcc = 67.3521%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 340285 sequences (423.992Hr): Obj(log[Pzx]) = -32.077 TokenAcc = 67.5502%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 350304 sequences (437.536Hr): Obj(log[Pzx]) = -3.17996e+29 TokenAcc = 18.6188%
job 3
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 290221 sequences (367.77Hr): Obj(log[Pzx]) = -33.8591 TokenAcc = 65.0273%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 300232 sequences (380.856Hr): Obj(log[Pzx]) = -36.8064 TokenAcc = 65.3367%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 310232 sequences (393.478Hr): Obj(log[Pzx]) = -33.5105 TokenAcc = 65.6533%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 320235 sequences (406.032Hr): Obj(log[Pzx]) = -9.997e+25 TokenAcc = 66.9717%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 330242 sequences (418.355Hr): Obj(log[Pzx]) = -32.2745 TokenAcc = 66.3963%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 340247 sequences (433.924Hr): Obj(log[Pzx]) = -41.3272 TokenAcc = 67.6148%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 350257 sequences (448.286Hr): Obj(log[Pzx]) = nan TokenAcc = 29.8061%
job2 crash
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 350304 sequences (437.536Hr): Obj(log[Pzx]) = -3.17996e+29 TokenAcc = 18.6188%
LOG (train-ctc-parallel:comm_avg_weights():net/communicator.h:106) Waiting for averaged model at exp/train_phn_l3_c320/nnet/nnet.iter1.avg500
ERROR (train-ctc-parallel:Check():net.cc:397) 'nan' in network parameters
WARNING (train-ctc-parallel:Close():kaldi-io.cc:446) Pipe apply-cmvn --norm-vars=true --utt2spk=ark:data_fbank/train_nodup_tr/utt2spk scp:data_fbank/train_nodup_tr/cmvn.scp scp:exp/train_phn_l3_c320/feats_tr.2.scp ark:- | add-deltas ark:- ark:- | had nonzero return status 36096
ERROR (train-ctc-parallel:Check():net.cc:397) 'nan' in network parameters
[stack trace: ]
eesen::KaldiGetStackTrace()
eesen::KaldiErrorMessage::~KaldiErrorMessage()
eesen::Net::Check() const
eesen::Net::Write(std::ostream&, bool) const
eesen::Net::Write(std::string const&, bool) const
comm_avg_weights(eesen::Net&, int const&, int const&, int const&, std::string const&, std::string const&)
train-ctc-parallel(main+0x1223) [0x48b1e7]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f777206eec5]
train-ctc-parallel() [0x488aa9]
I'll spend some time on this.
All jobs exploded. What's your model scale and learning rate?
Xingyu
在 2016/6/7 11:14, Feiteng Li 写道:
Hi, guys: I'm trying CTC on a big dataset more than 2000hours, using |steps/train_ctc_parallel_h.sh --nj 3| job 1
|VLOG1 After 280196 sequences (343.467Hr): Obj(log[Pzx]) = -34.12 TokenAcc = 65.5557% VLOG1 After 290214 sequences (356.577Hr): Obj(log[Pzx]) = -35.0829 TokenAcc = 66.0633% VLOG1 After 300228 sequences (369.041Hr): Obj(log[Pzx]) = -35.5174 TokenAcc = 64.8673% VLOG1 After 310238 sequences (382.234Hr): Obj(log[Pzx]) = -35.9657 TokenAcc = 65.8774% VLOG1 After 320254 sequences (394.763Hr): Obj(log[Pzx]) = -33.7356 TokenAcc = 66.6929% VLOG1 After 330265 sequences (407.316Hr): Obj(log[Pzx]) = -32.8957 TokenAcc = 67.0366% VLOG1 After 340269 sequences (420.524Hr): Obj(log[Pzx]) = -36.0733 TokenAcc = 66.5803% VLOG1 After 350272 sequences (433.553Hr): Obj(log[Pzx]) = -3.06908e+29 TokenAcc = 13.6926% |
job 2
|VLOG1 After 290240 sequences (361.23Hr): Obj(log[Pzx]) = -34.8453 TokenAcc = 65.5767% VLOG1 After 300251 sequences (374.192Hr): Obj(log[Pzx]) = -34.6744 TokenAcc = 65.9575% VLOG1 After 310255 sequences (386.489Hr): Obj(log[Pzx]) = -32.8376 TokenAcc = 66.4874% VLOG1 After 320258 sequences (399.27Hr): Obj(log[Pzx]) = -34.0779 TokenAcc = 66.6884% VLOG1 After 330274 sequences (411.385Hr): Obj(log[Pzx]) = -31.8291 TokenAcc = 67.3521% VLOG1 After 340285 sequences (423.992Hr): Obj(log[Pzx]) = -32.077 TokenAcc = 67.5502% VLOG1 After 350304 sequences (437.536Hr): Obj(log[Pzx]) = -3.17996e+29 TokenAcc = 18.6188% |
job 3
|VLOG1 After 290221 sequences (367.77Hr): Obj(log[Pzx]) = -33.8591 TokenAcc = 65.0273% VLOG1 After 300232 sequences (380.856Hr): Obj(log[Pzx]) = -36.8064 TokenAcc = 65.3367% VLOG1 After 310232 sequences (393.478Hr): Obj(log[Pzx]) = -33.5105 TokenAcc = 65.6533% VLOG1 After 320235 sequences (406.032Hr): Obj(log[Pzx]) = -9.997e+25 TokenAcc = 66.9717% VLOG1 After 330242 sequences (418.355Hr): Obj(log[Pzx]) = -32.2745 TokenAcc = 66.3963% VLOG1 After 340247 sequences (433.924Hr): Obj(log[Pzx]) = -41.3272 TokenAcc = 67.6148% VLOG1 After 350257 sequences (448.286Hr): Obj(log[Pzx]) = nan TokenAcc = 29.8061% |
job2 crash
|VLOG1 After 350304 sequences (437.536Hr): Obj(log[Pzx]) = -3.17996e+29 TokenAcc = 18.6188% LOG (train-ctc-parallel:comm_avg_weights():net/communicator.h:106) Waiting for averaged model at exp/train_phn_l3_c320/nnet/nnet.iter1.avg500 ERROR (train-ctc-parallel:Check():net.cc:397) 'nan' in network parameters WARNING (train-ctc-parallel:Close():kaldi-io.cc:446) Pipe apply-cmvn --norm-vars=true --utt2spk=ark:data_fbank/train_nodup_tr/utt2spk scp:data_fbank/train_nodup_tr/cmvn.scp scp:exp/train_phn_l3_c320/feats_tr.2.scp ark:- | add-deltas ark:- ark:- | had nonzero return status 36096 ERROR (train-ctc-parallel:Check():net.cc:397) 'nan' in network parameters [stack trace: ] eesen::KaldiGetStackTrace() eesen::KaldiErrorMessage::~KaldiErrorMessage() eesen::Net::Check() const eesen::Net::Write(std::ostream&, bool) const eesen::Net::Write(std::string const&, bool) const comm_avg_weights(eesen::Net&, int const&, int const&, int const&, std::string const&, std::string const&) train-ctc-parallel(main+0x1223) [0x48b1e7] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f777206eec5] train-ctc-parallel() [0x488aa9] |
I'll spend some time on this.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/60, or mute the thread https://github.com/notifications/unsubscribe/ADKpxPDC7_oR5QnhqGG0RK5SqlhrmpJfks5qJOIEgaJpZM4IvhCq.
Parallel training from the very beginning is risky. Training with one GPU for one or two iterations and then switching to parallel training might give you greater stability.
@gowayyed may have more insight on this.
@naxingyu
input_feat_dim=120 # dimension of the input features; we will use 40-dimensional fbanks with deltas and double deltas
lstm_layer_num=3 # number of LSTM layers
lstm_cell_dim=320 # number of memory cells in every LSTM layer
--num-sequence 20 --frame-num-limit 25000 --learn-rate 0.00004
I agree with Yajie, that the training would be very unstable if you parallelize it from the beginning. Part of the reason is that nnet1 does not have the tricks for penalize the parameters in parallel training besides natural gradient, which are applied in nnet2 and nnet3, especially and extensively for training LSTMs. And last time I checked, half a year ago..., the parallel training of nnet1 in Eesen is still far from properly tuned...
I am currently adding natural gradient to nnet1 in eesen. Hopefully, this will add stability to our parallel training.