marian
marian copied to clipboard
Training stuck in validation phase.
I'm training a transformer model on a corpus of 30M sentences with the following command line parameters:
MARIAN_EXEC=~/marian-dev/build/marian
${MARIAN_EXEC} \
--devices 0 \
--type transformer \
--model ${MODEL_HOME_DIR}/model_en-it.npz \
--train-sets ${MODEL_HOME_DIR}/corpus-tr-30M.en ${MODEL_HOME_DIR}/corpus-tr-30M.it \
--vocabs ${MODEL_HOME_DIR}/vocab.en-it.spm ${MODEL_HOME_DIR}/vocab.en-it.spm \
--dim-vocabs 32000 32000 \
--sentencepiece-options '--normalization_rule_name=nmt_nfkc' \
--mini-batch-fit -w 12000 \
--sentencepiece-max-lines 1000000 \
--layer-normalization --tied-embeddings-all \
--dropout-src 0.1 --dropout-trg 0.1 \
--early-stopping 10 --max-length 90 \
--valid-freq 10000 --save-freq 5000 --disp-freq 500 \
--cost-type ce-mean-words --valid-metrics ce-mean-words bleu-detok \
--valid-sets ${MODEL_HOME_DIR}/devset.en ${MODEL_HOME_DIR}/devset.it \
--log ${MODEL_HOME_DIR}/train.log \
--valid-log ${MODEL_HOME_DIR}/valid.log \
--tempdir ${MODEL_HOME_DIR}/temp_files \
--overwrite --keep-best \
--seed 1111 --exponential-smoothing \
--normalize=0.6 --beam-size=6 --quiet-translation \
--learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report \
--enc-depth 6 --dec-depth 6 \
--optimizer-params 0.9 0.98 1e-09 --clip-norm 5 \
--transformer-heads 8 \
--transformer-postprocess-emb d \
--transformer-postprocess dan \
--transformer-dropout 0.1 --label-smoothing 0.1
The training stops after 10000 updates, and the process runs
for days without any progress. With a validation frequency of 5000 I see the output of the validation step, but the training seems to halt with the process using one cpu core fully.
This is the output of top
top - 14:41:49 up 4 days, 18:03, 1 user, load average: 1.00, 1.02, 1.00
Tasks: 203 total, 1 running, 202 sleeping, 0 stopped, 0 zombie
%Cpu(s): 25.1 us, 0.2 sy, 0.0 ni, 74.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 15718.1 total, 228.6 free, 8498.2 used, 6991.2 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 6880.6 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
302151 ubuntu 20 0 25.9g 7.8g 147460 S 99.7 51.0 130:38.85 marian
1475 ubuntu 20 0 13744 9764 2764 S 0.7 0.1 7:59.81 tmux: server
302792 root -51 0 0 0 0 S 0.3 0.0 0:31.97 irq/42-nvidia
302797 root 20 0 0 0 0 S 0.3 0.0 0:14.71 nv_queue
1 root 20 0 169004 10380 5768 S 0.0 0.1 0:23.19 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.04 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H-kblo+
9 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
My configuration is
- Marian dev 1.9.0
- Cuda 10.2
- Boost 1.71
- Nvidia Tesla T4
I'm also attaching the train.log
Any idea of what I could be doing wrong? Thanks in advance for your help