trafficstars

Bug description

At a certain point of the data, the batch statistics somehow "disappears" or went missing. I recall there was an issue that outputs the batch to a temp file, but I've been using --shuffle-in-ram so not sure where to find the bad batch. Maybe it's https://github.com/marian-nmt/marian-dev/issues/480 ?

Will try to re-run without --shuffle-in-ram and see if goes pass that same batch. Meanwhile, is it possible to skip batch and move to the next one if one batch's statistics is missing?

[2022-07-03 20:27:36] Saving Adam parameters
[2022-07-03 20:27:37] [training] Saving training checkpoint to /home/ubuntu/stash/transliterate/cjk-transliterate-6+6-8-1024-4096-12000-0.0001-0.1/cjk-zz-r42-controllable/model.npz and /home/ubuntu/stash/transliterate/cjk-transliterate-6+6-8-1024-4096-12000-0.0001-0.1/cjk-zz-r42-controllable/model.npz.optimizer.npz
[2022-07-03 20:47:31] Saving model weights and runtime parameters to /home/ubuntu/stash/transliterate/cjk-transliterate-6+6-8-1024-4096-12000-0.0001-0.1/cjk-zz-r42-controllable/model.npz.best-ce-mean-words.npz
[2022-07-03 20:47:38] [valid] Ep. 1 : Up. 13000 : ce-mean-words : 0.225207 : new best
[2022-07-03 21:07:02] Saving model weights and runtime parameters to /home/ubuntu/stash/transliterate/cjk-transliterate-6+6-8-1024-4096-12000-0.0001-0.1/cjk-zz-r42-controllable/model.npz.best-perplexity.npz
[2022-07-03 21:07:08] [valid] Ep. 1 : Up. 13000 : perplexity : 1.25258 : new best
[2022-07-03 21:07:17] Error: Missing batch statistics
[2022-07-03 21:07:17] Error: Aborted from size_t marian::data::BatchStats::findBatchSize(const std::vector<long unsigned int>&, marian::data::BatchStats::const_iterator&) const in /home/ubuntu/marian/src/data/batch_stats.h:38

[CALL STACK]
[0x5654ccc315a7]    marian::data::BatchStats::  findBatchSize  (std::vector<unsigned long,std::allocator<unsigned long>> const&,  std::_Rb_tree_const_iterator<std::pair<std::vector<unsigned long,std::allocator<unsigned long>> const,unsigned long>>&) const + 0x277
[0x5654ccc8800c]    marian::data::BatchGenerator<marian::data::CorpusBase>::  fetchBatches  () + 0x181c
[0x5654ccc889b3]    marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(std::result_of&&,(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)...)::{lambda()#1}::  operator()  () const + 0x33
[0x5654ccc89573]    std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> (),std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>>>,std::__future_base::_Result_base::_Deleter>,std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(std::result_of&&,(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)...)::{lambda()#1},std::allocator<int>,std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>> ()>::_M_run()::{lambda()#1},std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>>>>::  _M_invoke  (std::_Any_data const&) + 0x53
[0x5654ccbb63ad]    std::__future_base::_State_baseV2::  _M_do_set  (std::function<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> ()>*,  bool*) + 0x2d
[0x7fa69415747f]                                                       + 0x1147f
[0x5654ccbc1710]    std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(std::result_of&&,(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)...)::{lambda()#1},std::allocator<int>,std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr>> ()>::  _M_run  () + 0x120
[0x5654ccbb7a30]    std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>::  _M_run  () + 0x180
[0x5654ceebd5b4]                                                       + 0x34745b4
[0x7fa69414e609]                                                       + 0x8609
[0x7fa693f24163]    clone                                              + 0x43

How to reproduce

See next comment.

Context

Marian version: v1.11.0 f00d0621 2022-02-08 08:39:24 -0800

Jul 04 '22 04:07 alvations

Data

https://drive.google.com/file/d/1hr-RzBz-5zCMhwbi4ogzPulVB1ZLCDUI/view?usp=sharing

Marian version

v1.11.0 f00d0621 2022-02-08 08:39:24 -0800

Command

#!/bin/bash

SRC=fr           # en
TRG=xx           # ja
RANDSEED=42      # 42
ELAYERS=6       # 6
DLAYERS=6
HEADS=8         # 8
DIMEMB=1024        # 1024
DIMTRA=4096      # 4096
VOCABSIZE=8000     # 32000
LR=0.0001         # 0.0001
DROPOUT=0.1    # 0.1

MODELDIR=$HOME/stash/fdi-$ELAYERS+$DLAYERS-$HEADS-$DIMEMB-$DIMTRA-$VOCABSIZE-$LR-$DROPOUT/$SRC-$TRG-r$RANDSEED/

mkdir -p $MODELDIR

DATADIR=$HOME/stash/fdi-data

TRAIN_SRC=$DATADIR/train.$SRC-$TRG.$SRC
TRAIN_TRG=$DATADIR/train.$SRC-$TRG.$TRG
VALID_SRC=$DATADIR/valid.$SRC-$TRG.$SRC
VALID_TRG=$DATADIR/valid.$SRC-$TRG.$TRG
TRAINLOG=$MODELDIR/train.log
VALIDLOG=$MODELDIR/valid.log

GPUS="0"
WORKSPACE=10185  # Assumes 11GB RAM on GPU

MARIAN=$HOME/marian/build/marian

$MARIAN --model $MODELDIR/model.npz --type transformer \
--train-sets $TRAIN_SRC $TRAIN_TRG \
--vocabs $MODELDIR/vocab.src.spm $MODELDIR/vocab.src.spm \
--dim-vocabs $VOCABSIZE $VOCABSIZE \
--valid-freq 500 --save-freq 500 --disp-freq 00 \
--valid-metrics ce-mean-words perplexity  \
--valid-sets $VALID_SRC $VALID_TRG \
--quiet-translation \
--beam-size 12 --normalize=0.6 \
--valid-mini-batch 16 \
--early-stopping 5 --cost-type=ce-mean-words \
--log $TRAINLOG --valid-log $VALIDLOG \
--enc-depth $ELAYERS --dec-depth $DLAYERS \
--transformer-preprocess n --transformer-postprocess da \
--tied-embeddings-all --dim-emb $DIMEMB --transformer-dim-ffn $DIMTRA \
--transformer-dropout $DROPOUT --transformer-dropout-attention $DROPOUT \
--transformer-dropout-ffn $DROPOUT --label-smoothing $DROPOUT \
--learn-rate $LR \
--lr-warmup 8000 --lr-decay-inv-sqrt 8000 --lr-report \
--optimizer-params 0.9 0.98 1e-09 --clip-norm 5 \
--devices $GPUS --workspace $WORKSPACE  --optimizer-delay 1 --sync-sgd --seed $RANDSEED \
--exponential-smoothing \
--keep-best \
--max-length 5000 --valid-max-length 5000 --max-length-crop \
--shuffle-in-ram --mini-batch-fit \
--sentencepiece-options "--character_coverage=1.0 --user_defined_symbols=BE,CA,CH,FR"

Logfile

https://gist.github.com/alvations/9da72d5458c409e8971ee3c65d550a85

Hardware

$ nvidia-smi 
Mon Jul  4 22:06:50 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    On   | 00000000:06:00.0 Off |                  Off |
| 30%   36C    P8    18W / 300W |      1MiB / 48685MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Jul 04 '22 22:07 alvations

This is interesting,

Broke batching: --max-length 5000 --valid-max-length 5000
Broke batching: --max-length 2000 --valid-max-length 2000
Seems to work: --max-length 1000 --valid-max-length 1000

Jul 04 '22 22:07 alvations

marian-dev marian-dev copied to clipboard

Missing batch statistics

Bug description

How to reproduce

Context

Data

Marian version

Command

Logfile

Hardware

marian-dev
marian-dev copied to clipboard