marian-dev
marian-dev copied to clipboard
Missing batch statistics
Bug description
At a certain point of the data, the batch statistics somehow "disappears" or went missing. I recall there was an issue that outputs the batch to a temp file, but I've been using --shuffle-in-ram so not sure where to find the bad batch. Maybe it's https://github.com/marian-nmt/marian-dev/issues/480 ?
Will try to re-run without --shuffle-in-ram and see if goes pass that same batch. Meanwhile, is it possible to skip batch and move to the next one if one batch's statistics is missing?
[2022-07-03 20:27:36] Saving Adam parameters
[2022-07-03 20:27:37] [training] Saving training checkpoint to /home/ubuntu/stash/transliterate/cjk-transliterate-6+6-8-1024-4096-12000-0.0001-0.1/cjk-zz-r42-controllable/model.npz and /home/ubuntu/stash/transliterate/cjk-transliterate-6+6-8-1024-4096-12000-0.0001-0.1/cjk-zz-r42-controllable/model.npz.optimizer.npz
[2022-07-03 20:47:31] Saving model weights and runtime parameters to /home/ubuntu/stash/transliterate/cjk-transliterate-6+6-8-1024-4096-12000-0.0001-0.1/cjk-zz-r42-controllable/model.npz.best-ce-mean-words.npz
[2022-07-03 20:47:38] [valid] Ep. 1 : Up. 13000 : ce-mean-words : 0.225207 : new best
[2022-07-03 21:07:02] Saving model weights and runtime parameters to /home/ubuntu/stash/transliterate/cjk-transliterate-6+6-8-1024-4096-12000-0.0001-0.1/cjk-zz-r42-controllable/model.npz.best-perplexity.npz
[2022-07-03 21:07:08] [valid] Ep. 1 : Up. 13000 : perplexity : 1.25258 : new best
[2022-07-03 21:07:17] Error: Missing batch statistics
[2022-07-03 21:07:17] Error: Aborted from size_t marian::data::BatchStats::findBatchSize(const std::vector<long unsigned int>&, marian::data::BatchStats::const_iterator&) const in /home/ubuntu/marian/src/data/batch_stats.h:38
[CALL STACK]
[0x5654ccc315a7] marian::data::BatchStats:: findBatchSize (std::vector<unsigned long,std::allocator<unsigned long>> const&, std::_Rb_tree_const_iterator<std::pair<std::vector<unsigned long,std::allocator<unsigned long>> const,unsigned long>>&) const + 0x277
[0x5654ccc8800c] marian::data::BatchGenerator<marian::data::CorpusBase>:: fetchBatches () + 0x181c
[0x5654ccc889b3] marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(std::result_of&&,(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)...)::{lambda()#1}:: operator() () const + 0x33
[0x5654ccc89573] std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> (),std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>>>,std::__future_base::_Result_base::_Deleter>,std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(std::result_of&&,(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)...)::{lambda()#1},std::allocator<int>,std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>> ()>::_M_run()::{lambda()#1},std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>>>>:: _M_invoke (std::_Any_data const&) + 0x53
[0x5654ccbb63ad] std::__future_base::_State_baseV2:: _M_do_set (std::function<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 0x2d
[0x7fa69415747f] + 0x1147f
[0x5654ccbc1710] std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(std::result_of&&,(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)...)::{lambda()#1},std::allocator<int>,std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr>> ()>:: _M_run () + 0x120
[0x5654ccbb7a30] std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>:: _M_run () + 0x180
[0x5654ceebd5b4] + 0x34745b4
[0x7fa69414e609] + 0x8609
[0x7fa693f24163] clone + 0x43
How to reproduce
See next comment.
Context
- Marian version:
v1.11.0 f00d0621 2022-02-08 08:39:24 -0800
Data
https://drive.google.com/file/d/1hr-RzBz-5zCMhwbi4ogzPulVB1ZLCDUI/view?usp=sharing
Marian version
v1.11.0 f00d0621 2022-02-08 08:39:24 -0800
Command
#!/bin/bash
SRC=fr # en
TRG=xx # ja
RANDSEED=42 # 42
ELAYERS=6 # 6
DLAYERS=6
HEADS=8 # 8
DIMEMB=1024 # 1024
DIMTRA=4096 # 4096
VOCABSIZE=8000 # 32000
LR=0.0001 # 0.0001
DROPOUT=0.1 # 0.1
MODELDIR=$HOME/stash/fdi-$ELAYERS+$DLAYERS-$HEADS-$DIMEMB-$DIMTRA-$VOCABSIZE-$LR-$DROPOUT/$SRC-$TRG-r$RANDSEED/
mkdir -p $MODELDIR
DATADIR=$HOME/stash/fdi-data
TRAIN_SRC=$DATADIR/train.$SRC-$TRG.$SRC
TRAIN_TRG=$DATADIR/train.$SRC-$TRG.$TRG
VALID_SRC=$DATADIR/valid.$SRC-$TRG.$SRC
VALID_TRG=$DATADIR/valid.$SRC-$TRG.$TRG
TRAINLOG=$MODELDIR/train.log
VALIDLOG=$MODELDIR/valid.log
GPUS="0"
WORKSPACE=10185 # Assumes 11GB RAM on GPU
MARIAN=$HOME/marian/build/marian
$MARIAN --model $MODELDIR/model.npz --type transformer \
--train-sets $TRAIN_SRC $TRAIN_TRG \
--vocabs $MODELDIR/vocab.src.spm $MODELDIR/vocab.src.spm \
--dim-vocabs $VOCABSIZE $VOCABSIZE \
--valid-freq 500 --save-freq 500 --disp-freq 00 \
--valid-metrics ce-mean-words perplexity \
--valid-sets $VALID_SRC $VALID_TRG \
--quiet-translation \
--beam-size 12 --normalize=0.6 \
--valid-mini-batch 16 \
--early-stopping 5 --cost-type=ce-mean-words \
--log $TRAINLOG --valid-log $VALIDLOG \
--enc-depth $ELAYERS --dec-depth $DLAYERS \
--transformer-preprocess n --transformer-postprocess da \
--tied-embeddings-all --dim-emb $DIMEMB --transformer-dim-ffn $DIMTRA \
--transformer-dropout $DROPOUT --transformer-dropout-attention $DROPOUT \
--transformer-dropout-ffn $DROPOUT --label-smoothing $DROPOUT \
--learn-rate $LR \
--lr-warmup 8000 --lr-decay-inv-sqrt 8000 --lr-report \
--optimizer-params 0.9 0.98 1e-09 --clip-norm 5 \
--devices $GPUS --workspace $WORKSPACE --optimizer-delay 1 --sync-sgd --seed $RANDSEED \
--exponential-smoothing \
--keep-best \
--max-length 5000 --valid-max-length 5000 --max-length-crop \
--shuffle-in-ram --mini-batch-fit \
--sentencepiece-options "--character_coverage=1.0 --user_defined_symbols=BE,CA,CH,FR"
Logfile
https://gist.github.com/alvations/9da72d5458c409e8971ee3c65d550a85
Hardware
$ nvidia-smi
Mon Jul 4 22:06:50 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:06:00.0 Off | Off |
| 30% 36C P8 18W / 300W | 1MiB / 48685MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
This is interesting,
-
Broke batching:
--max-length 5000 --valid-max-length 5000 -
Broke batching:
--max-length 2000 --valid-max-length 2000 -
Seems to work:
--max-length 1000 --valid-max-length 1000