eesen icon indicating copy to clipboard operation
eesen copied to clipboard

nan Obj value

Open lifeiteng opened this issue 8 years ago • 5 comments

Hi, guys: I'm trying CTC on a big dataset more than 2000hours, using steps/train_ctc_parallel_h.sh --nj 3 job 1

VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 280196 sequences (343.467Hr): Obj(log[Pzx]) = -34.12   TokenAcc = 65.5557%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 290214 sequences (356.577Hr): Obj(log[Pzx]) = -35.0829   TokenAcc = 66.0633%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 300228 sequences (369.041Hr): Obj(log[Pzx]) = -35.5174   TokenAcc = 64.8673%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 310238 sequences (382.234Hr): Obj(log[Pzx]) = -35.9657   TokenAcc = 65.8774%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 320254 sequences (394.763Hr): Obj(log[Pzx]) = -33.7356   TokenAcc = 66.6929%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 330265 sequences (407.316Hr): Obj(log[Pzx]) = -32.8957   TokenAcc = 67.0366%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 340269 sequences (420.524Hr): Obj(log[Pzx]) = -36.0733   TokenAcc = 66.5803%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 350272 sequences (433.553Hr): Obj(log[Pzx]) = -3.06908e+29   TokenAcc = 13.6926%

job 2

VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 290240 sequences (361.23Hr): Obj(log[Pzx]) = -34.8453   TokenAcc = 65.5767%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 300251 sequences (374.192Hr): Obj(log[Pzx]) = -34.6744   TokenAcc = 65.9575%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 310255 sequences (386.489Hr): Obj(log[Pzx]) = -32.8376   TokenAcc = 66.4874%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 320258 sequences (399.27Hr): Obj(log[Pzx]) = -34.0779   TokenAcc = 66.6884%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 330274 sequences (411.385Hr): Obj(log[Pzx]) = -31.8291   TokenAcc = 67.3521%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 340285 sequences (423.992Hr): Obj(log[Pzx]) = -32.077   TokenAcc = 67.5502%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 350304 sequences (437.536Hr): Obj(log[Pzx]) = -3.17996e+29   TokenAcc = 18.6188%

job 3

VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 290221 sequences (367.77Hr): Obj(log[Pzx]) = -33.8591   TokenAcc = 65.0273%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 300232 sequences (380.856Hr): Obj(log[Pzx]) = -36.8064   TokenAcc = 65.3367%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 310232 sequences (393.478Hr): Obj(log[Pzx]) = -33.5105   TokenAcc = 65.6533%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 320235 sequences (406.032Hr): Obj(log[Pzx]) = -9.997e+25   TokenAcc = 66.9717%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 330242 sequences (418.355Hr): Obj(log[Pzx]) = -32.2745   TokenAcc = 66.3963%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 340247 sequences (433.924Hr): Obj(log[Pzx]) = -41.3272   TokenAcc = 67.6148%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 350257 sequences (448.286Hr): Obj(log[Pzx]) = nan   TokenAcc = 29.8061%

job2 crash

VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 350304 sequences (437.536Hr): Obj(log[Pzx]) = -3.17996e+29   TokenAcc = 18.6188%
LOG (train-ctc-parallel:comm_avg_weights():net/communicator.h:106) Waiting for averaged model at exp/train_phn_l3_c320/nnet/nnet.iter1.avg500
ERROR (train-ctc-parallel:Check():net.cc:397) 'nan' in network parameters
WARNING (train-ctc-parallel:Close():kaldi-io.cc:446) Pipe apply-cmvn --norm-vars=true --utt2spk=ark:data_fbank/train_nodup_tr/utt2spk scp:data_fbank/train_nodup_tr/cmvn.scp scp:exp/train_phn_l3_c320/feats_tr.2.scp ark:- | add-deltas ark:- ark:- | had nonzero return status 36096
ERROR (train-ctc-parallel:Check():net.cc:397) 'nan' in network parameters

[stack trace: ]
eesen::KaldiGetStackTrace()
eesen::KaldiErrorMessage::~KaldiErrorMessage()
eesen::Net::Check() const
eesen::Net::Write(std::ostream&, bool) const
eesen::Net::Write(std::string const&, bool) const
comm_avg_weights(eesen::Net&, int const&, int const&, int const&, std::string const&, std::string const&)
train-ctc-parallel(main+0x1223) [0x48b1e7]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f777206eec5]
train-ctc-parallel() [0x488aa9]

I'll spend some time on this.

lifeiteng avatar Jun 07 '16 03:06 lifeiteng

All jobs exploded. What's your model scale and learning rate?

Xingyu

在 2016/6/7 11:14, Feiteng Li 写道:

Hi, guys: I'm trying CTC on a big dataset more than 2000hours, using |steps/train_ctc_parallel_h.sh --nj 3| job 1

|VLOG1 After 280196 sequences (343.467Hr): Obj(log[Pzx]) = -34.12 TokenAcc = 65.5557% VLOG1 After 290214 sequences (356.577Hr): Obj(log[Pzx]) = -35.0829 TokenAcc = 66.0633% VLOG1 After 300228 sequences (369.041Hr): Obj(log[Pzx]) = -35.5174 TokenAcc = 64.8673% VLOG1 After 310238 sequences (382.234Hr): Obj(log[Pzx]) = -35.9657 TokenAcc = 65.8774% VLOG1 After 320254 sequences (394.763Hr): Obj(log[Pzx]) = -33.7356 TokenAcc = 66.6929% VLOG1 After 330265 sequences (407.316Hr): Obj(log[Pzx]) = -32.8957 TokenAcc = 67.0366% VLOG1 After 340269 sequences (420.524Hr): Obj(log[Pzx]) = -36.0733 TokenAcc = 66.5803% VLOG1 After 350272 sequences (433.553Hr): Obj(log[Pzx]) = -3.06908e+29 TokenAcc = 13.6926% |

job 2

|VLOG1 After 290240 sequences (361.23Hr): Obj(log[Pzx]) = -34.8453 TokenAcc = 65.5767% VLOG1 After 300251 sequences (374.192Hr): Obj(log[Pzx]) = -34.6744 TokenAcc = 65.9575% VLOG1 After 310255 sequences (386.489Hr): Obj(log[Pzx]) = -32.8376 TokenAcc = 66.4874% VLOG1 After 320258 sequences (399.27Hr): Obj(log[Pzx]) = -34.0779 TokenAcc = 66.6884% VLOG1 After 330274 sequences (411.385Hr): Obj(log[Pzx]) = -31.8291 TokenAcc = 67.3521% VLOG1 After 340285 sequences (423.992Hr): Obj(log[Pzx]) = -32.077 TokenAcc = 67.5502% VLOG1 After 350304 sequences (437.536Hr): Obj(log[Pzx]) = -3.17996e+29 TokenAcc = 18.6188% |

job 3

|VLOG1 After 290221 sequences (367.77Hr): Obj(log[Pzx]) = -33.8591 TokenAcc = 65.0273% VLOG1 After 300232 sequences (380.856Hr): Obj(log[Pzx]) = -36.8064 TokenAcc = 65.3367% VLOG1 After 310232 sequences (393.478Hr): Obj(log[Pzx]) = -33.5105 TokenAcc = 65.6533% VLOG1 After 320235 sequences (406.032Hr): Obj(log[Pzx]) = -9.997e+25 TokenAcc = 66.9717% VLOG1 After 330242 sequences (418.355Hr): Obj(log[Pzx]) = -32.2745 TokenAcc = 66.3963% VLOG1 After 340247 sequences (433.924Hr): Obj(log[Pzx]) = -41.3272 TokenAcc = 67.6148% VLOG1 After 350257 sequences (448.286Hr): Obj(log[Pzx]) = nan TokenAcc = 29.8061% |

job2 crash

|VLOG1 After 350304 sequences (437.536Hr): Obj(log[Pzx]) = -3.17996e+29 TokenAcc = 18.6188% LOG (train-ctc-parallel:comm_avg_weights():net/communicator.h:106) Waiting for averaged model at exp/train_phn_l3_c320/nnet/nnet.iter1.avg500 ERROR (train-ctc-parallel:Check():net.cc:397) 'nan' in network parameters WARNING (train-ctc-parallel:Close():kaldi-io.cc:446) Pipe apply-cmvn --norm-vars=true --utt2spk=ark:data_fbank/train_nodup_tr/utt2spk scp:data_fbank/train_nodup_tr/cmvn.scp scp:exp/train_phn_l3_c320/feats_tr.2.scp ark:- | add-deltas ark:- ark:- | had nonzero return status 36096 ERROR (train-ctc-parallel:Check():net.cc:397) 'nan' in network parameters [stack trace: ] eesen::KaldiGetStackTrace() eesen::KaldiErrorMessage::~KaldiErrorMessage() eesen::Net::Check() const eesen::Net::Write(std::ostream&, bool) const eesen::Net::Write(std::string const&, bool) const comm_avg_weights(eesen::Net&, int const&, int const&, int const&, std::string const&, std::string const&) train-ctc-parallel(main+0x1223) [0x48b1e7] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f777206eec5] train-ctc-parallel() [0x488aa9] |

I'll spend some time on this.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/60, or mute the thread https://github.com/notifications/unsubscribe/ADKpxPDC7_oR5QnhqGG0RK5SqlhrmpJfks5qJOIEgaJpZM4IvhCq.

naxingyu avatar Jun 07 '16 03:06 naxingyu

Parallel training from the very beginning is risky. Training with one GPU for one or two iterations and then switching to parallel training might give you greater stability.

@gowayyed may have more insight on this.

yajiemiao avatar Jun 07 '16 03:06 yajiemiao

@naxingyu

input_feat_dim=120   # dimension of the input features; we will use 40-dimensional fbanks with deltas and double deltas
lstm_layer_num=3    # number of LSTM layers
lstm_cell_dim=320    # number of memory cells in every LSTM layer

 --num-sequence 20 --frame-num-limit 25000 --learn-rate 0.00004

lifeiteng avatar Jun 07 '16 09:06 lifeiteng

I agree with Yajie, that the training would be very unstable if you parallelize it from the beginning. Part of the reason is that nnet1 does not have the tricks for penalize the parameters in parallel training besides natural gradient, which are applied in nnet2 and nnet3, especially and extensively for training LSTMs. And last time I checked, half a year ago..., the parallel training of nnet1 in Eesen is still far from properly tuned...

naxingyu avatar Jun 08 '16 02:06 naxingyu

I am currently adding natural gradient to nnet1 in eesen. Hopefully, this will add stability to our parallel training.

gowayyed avatar Jun 08 '16 19:06 gowayyed