marian icon indicating copy to clipboard operation
marian copied to clipboard

Training speed of Transformer-Big

Open vkramgovind opened this issue 5 years ago • 34 comments

Hi ,

I followed the instructions as specified in https://github.com/marian-nmt/marian/issues/145 Does that apply to Transformer-big also? Is this the expected speed?

HW is - Single Nvidia P6000 with 24GB RAM Here is my training config [2018-12-04 07:58:19] [config] after-batches: 0 [2018-12-04 07:58:19] [config] after-epochs: 0 [2018-12-04 07:58:19] [config] allow-unk: false [2018-12-04 07:58:19] [config] batch-flexible-lr: false [2018-12-04 07:58:19] [config] batch-normal-words: 1920 [2018-12-04 07:58:19] [config] beam-size: 6 [2018-12-04 07:58:19] [config] best-deep: false [2018-12-04 07:58:19] [config] clip-gemm: 0 [2018-12-04 07:58:19] [config] clip-norm: 5 [2018-12-04 07:58:19] [config] cost-type: ce-mean-words [2018-12-04 07:58:19] [config] cpu-threads: 0 [2018-12-04 07:58:19] [config] data-weighting-type: sentence [2018-12-04 07:58:19] [config] dec-cell: gru [2018-12-04 07:58:19] [config] dec-cell-base-depth: 2 [2018-12-04 07:58:19] [config] dec-cell-high-depth: 1 [2018-12-04 07:58:19] [config] dec-depth: 6 [2018-12-04 07:58:19] [config] devices: [2018-12-04 07:58:19] [config] - 0 [2018-12-04 07:58:19] [config] dim-emb: 1024 [2018-12-04 07:58:19] [config] dim-rnn: 1024 [2018-12-04 07:58:19] [config] dim-vocabs: [2018-12-04 07:58:19] [config] - 0 [2018-12-04 07:58:19] [config] - 0 [2018-12-04 07:58:19] [config] disp-freq: 500 [2018-12-04 07:58:19] [config] disp-label-counts: false [2018-12-04 07:58:19] [config] dropout-rnn: 0 [2018-12-04 07:58:19] [config] dropout-src: 0 [2018-12-04 07:58:19] [config] dropout-trg: 0 [2018-12-04 07:58:19] [config] early-stopping: 10 [2018-12-04 07:58:19] [config] embedding-fix-src: false [2018-12-04 07:58:19] [config] embedding-fix-trg: false [2018-12-04 07:58:19] [config] embedding-normalization: false [2018-12-04 07:58:19] [config] enc-cell: gru [2018-12-04 07:58:19] [config] enc-cell-depth: 1 [2018-12-04 07:58:19] [config] enc-depth: 6 [2018-12-04 07:58:19] [config] enc-type: bidirectional [2018-12-04 07:58:19] [config] exponential-smoothing: 0.0001 [2018-12-04 07:58:19] [config] grad-dropping-momentum: 0 [2018-12-04 07:58:19] [config] grad-dropping-rate: 0 [2018-12-04 07:58:19] [config] grad-dropping-warmup: 100 [2018-12-04 07:58:19] [config] guided-alignment-cost: ce [2018-12-04 07:58:19] [config] guided-alignment-weight: 1 [2018-12-04 07:58:19] [config] ignore-model-config: false [2018-12-04 07:58:19] [config] interpolate-env-vars: false [2018-12-04 07:58:19] [config] keep-best: true [2018-12-04 07:58:19] [config] label-smoothing: 0.1 [2018-12-04 07:58:19] [config] layer-normalization: false [2018-12-04 07:58:19] [config] learn-rate: 0.0002 [2018-12-04 07:58:19] [config] log: model/train.log [2018-12-04 07:58:19] [config] log-level: info [2018-12-04 07:58:19] [config] lr-decay: 0 [2018-12-04 07:58:19] [config] lr-decay-freq: 50000 [2018-12-04 07:58:19] [config] lr-decay-inv-sqrt: 8000 [2018-12-04 07:58:19] [config] lr-decay-repeat-warmup: false [2018-12-04 07:58:19] [config] lr-decay-reset-optimizer: false [2018-12-04 07:58:19] [config] lr-decay-start: [2018-12-04 07:58:19] [config] - 10 [2018-12-04 07:58:19] [config] - 1 [2018-12-04 07:58:19] [config] lr-decay-strategy: epoch+stalled [2018-12-04 07:58:19] [config] lr-report: true [2018-12-04 07:58:19] [config] lr-warmup: 8000 [2018-12-04 07:58:19] [config] lr-warmup-at-reload: false [2018-12-04 07:58:19] [config] lr-warmup-cycle: false [2018-12-04 07:58:19] [config] lr-warmup-start-rate: 0 [2018-12-04 07:58:19] [config] max-length: 100 [2018-12-04 07:58:19] [config] max-length-crop: false [2018-12-04 07:58:19] [config] max-length-factor: 3 [2018-12-04 07:58:19] [config] maxi-batch: 1000 [2018-12-04 07:58:19] [config] maxi-batch-sort: trg [2018-12-04 07:58:19] [config] mini-batch: 1000 [2018-12-04 07:58:19] [config] mini-batch-fit: true [2018-12-04 07:58:19] [config] mini-batch-fit-step: 2 [2018-12-04 07:58:19] [config] mini-batch-words: 0 [2018-12-04 07:58:19] [config] model: model/model.npz [2018-12-04 07:58:19] [config] multi-node: false [2018-12-04 07:58:19] [config] multi-node-overlap: true [2018-12-04 07:58:19] [config] n-best: false [2018-12-04 07:58:19] [config] no-reload: false [2018-12-04 07:58:19] [config] no-restore-corpus: false [2018-12-04 07:58:19] [config] no-shuffle: false [2018-12-04 07:58:19] [config] normalize: 0.6 [2018-12-04 07:58:19] [config] optimizer: adam [2018-12-04 07:58:19] [config] optimizer-delay: 1 [2018-12-04 07:58:19] [config] optimizer-params: [2018-12-04 07:58:19] [config] - 0.9 [2018-12-04 07:58:19] [config] - 0.98 [2018-12-04 07:58:19] [config] - 1e-09 [2018-12-04 07:58:19] [config] overwrite: true [2018-12-04 07:58:19] [config] quiet: false [2018-12-04 07:58:19] [config] quiet-translation: true [2018-12-04 07:58:19] [config] relative-paths: false [2018-12-04 07:58:19] [config] right-left: false [2018-12-04 07:58:19] [config] save-freq: 5000 [2018-12-04 07:58:19] [config] seed: 1111 [2018-12-04 07:58:19] [config] skip: false [2018-12-04 07:58:19] [config] sqlite: "" [2018-12-04 07:58:19] [config] sqlite-drop: false [2018-12-04 07:58:19] [config] sync-sgd: true [2018-12-04 07:58:19] [config] tempdir: /tmp [2018-12-04 07:58:19] [config] tied-embeddings: false [2018-12-04 07:58:19] [config] tied-embeddings-all: true [2018-12-04 07:58:19] [config] tied-embeddings-src: false [2018-12-04 07:58:19] [config] train-sets: [2018-12-04 07:58:19] [config] - data/corpus.bpe.en [2018-12-04 07:58:19] [config] - data/corpus.bpe.de [2018-12-04 07:58:19] [config] transformer-aan-activation: swish [2018-12-04 07:58:19] [config] transformer-aan-depth: 2 [2018-12-04 07:58:19] [config] transformer-aan-nogate: false [2018-12-04 07:58:19] [config] transformer-decoder-autoreg: average-attention [2018-12-04 07:58:19] [config] transformer-dim-aan: 2048 [2018-12-04 07:58:19] [config] transformer-dim-ffn: 4096 [2018-12-04 07:58:19] [config] transformer-dropout: 0.1 [2018-12-04 07:58:19] [config] transformer-dropout-attention: 0.1 [2018-12-04 07:58:19] [config] transformer-dropout-ffn: 0.1 [2018-12-04 07:58:19] [config] transformer-ffn-activation: relu [2018-12-04 07:58:19] [config] transformer-ffn-depth: 2 [2018-12-04 07:58:19] [config] transformer-heads: 16 [2018-12-04 07:58:19] [config] transformer-no-projection: false [2018-12-04 07:58:19] [config] transformer-postprocess: dan [2018-12-04 07:58:19] [config] transformer-postprocess-emb: d [2018-12-04 07:58:19] [config] transformer-preprocess: d [2018-12-04 07:58:19] [config] transformer-tied-layers: [2018-12-04 07:58:19] [config] [] [2018-12-04 07:58:19] [config] type: transformer [2018-12-04 07:58:19] [config] valid-freq: 5000 [2018-12-04 07:58:19] [config] valid-log: model/valid.log [2018-12-04 07:58:19] [config] valid-max-length: 1000 [2018-12-04 07:58:19] [config] valid-metrics: [2018-12-04 07:58:19] [config] - ce-mean-words [2018-12-04 07:58:19] [config] - perplexity [2018-12-04 07:58:19] [config] - translation [2018-12-04 07:58:19] [config] valid-mini-batch: 64 [2018-12-04 07:58:19] [config] valid-script-path: ./scripts/validate.sh [2018-12-04 07:58:19] [config] valid-sets: [2018-12-04 07:58:19] [config] - data/valid.bpe.en [2018-12-04 07:58:19] [config] - data/valid.bpe.de [2018-12-04 07:58:19] [config] valid-translation-output: data/valid.bpe.en.output [2018-12-04 07:58:19] [config] vocabs: [2018-12-04 07:58:19] [config] - model/vocab.ende.yml [2018-12-04 07:58:19] [config] - model/vocab.ende.yml [2018-12-04 07:58:19] [config] word-penalty: 0 [2018-12-04 07:58:19] [config] workspace: 18000

And the current speed is

[2018-12-04 21:47:47] Ep. 2 : Up. 45500 : Sen. 3225664 : Cost 4.50 : Time 699.73s : 3739.52 words/s : L.r. 8.3863e-05 [2018-12-04 21:56:30] Ep. 2 : Up. 46000 : Sen. 3311244 : Cost 4.45 : Time 523.15s : 4969.66 words/s : L.r. 8.3406e-05 [2018-12-04 22:05:18] Ep. 2 : Up. 46500 : Sen. 3396338 : Cost 4.48 : Time 528.35s : 5000.56 words/s : L.r. 8.2956e-05 [2018-12-04 22:14:06] Ep. 2 : Up. 47000 : Sen. 3486593 : Cost 4.40 : Time 527.75s : 4921.26 words/s : L.r. 8.2514e-05 [2018-12-04 22:22:48] Ep. 2 : Up. 47500 : Sen. 3569920 : Cost 4.47 : Time 521.72s : 4928.60 words/s : L.r. 8.2078e-05 [2018-12-04 22:31:34] Ep. 2 : Up. 48000 : Sen. 3654450 : Cost 4.44 : Time 525.71s : 4958.85 words/s : L.r. 8.1650e-05 [2018-12-04 22:40:16] Ep. 2 : Up. 48500 : Sen. 3743201 : Cost 4.38 : Time 522.82s : 4953.88 words/s : L.r. 8.1228e-05 [2018-12-04 22:49:02] Ep. 2 : Up. 49000 : Sen. 3826733 : Cost 4.47 : Time 525.84s : 4992.46 words/s : L.r. 8.0812e-05 [2018-12-04 22:57:44] Ep. 2 : Up. 49500 : Sen. 3909099 : Cost 4.50 : Time 522.03s : 4964.20 words/s : L.r. 8.0403e-05 [2018-12-04 23:06:24] Ep. 2 : Up. 50000 : Sen. 3992501 : Cost 4.46 : Time 520.13s : 4950.51 words/s : L.r. 8.0000e-05 [2018-12-04 23:06:24] Saving model weights and runtime parameters to model/model.npz.orig.npz [2018-12-04 23:06:29] Saving model weights and runtime parameters to model/model.npz [2018-12-04 23:06:34] Saving Adam parameters to model/model.npz.optimizer.npz [2018-12-04 23:06:51] Saving model weights and runtime parameters to model/model.npz.best-ce-mean-words.npz [2018-12-04 23:06:56] [valid] Ep. 2 : Up. 50000 : ce-mean-words : 3.87816 : new best [2018-12-04 23:07:03] Saving model weights and runtime parameters to model/model.npz.best-perplexity.npz [2018-12-04 23:07:07] [valid] Ep. 2 : Up. 50000 : perplexity : 48.3355 : new best [2018-12-04 23:09:10] Saving model weights and runtime parameters to model/model.npz.best-translation.npz [2018-12-04 23:09:15] [valid] Ep. 2 : Up. 50000 : translation : 2.12 : new best [2018-12-04 23:17:57] Ep. 2 : Up. 50500 : Sen. 4075294 : Cost 4.47 : Time 692.35s : 3770.54 words/s : L.r. 7.9603e-05 [2018-12-04 23:26:38] Ep. 2 : Up. 51000 : Sen. 4162162 : Cost 4.40 : Time 521.21s : 4921.39 words/s : L.r. 7.9212e-05 [2018-12-04 23:35:19] Ep. 2 : Up. 51500 : Sen. 4247536 : Cost 4.42 : Time 520.58s : 4929.12 words/s : L.r. 7.8826e-05 [2018-12-04 23:44:04] Ep. 2 : Up. 52000 : Sen. 4330899 : Cost 4.42 : Time 525.49s : 4922.56 words/s : L.r. 7.8446e-05 [2018-12-04 23:52:42] Ep. 2 : Up. 52500 : Sen. 4416347 : Cost 4.41 : Time 518.34s : 4899.49 words/s : L.r. 7.8072e-05 [2018-12-05 00:01:19] Ep. 2 : Up. 53000 : Sen. 4496972 : Cost 4.48 : Time 516.69s : 4917.49 words/s : L.r. 7.7703e-05 [2018-12-05 00:04:53] Seen 4532796 samples [2018-12-05 00:04:53] Starting epoch 3 [2018-12-05 00:04:53] [data] Shuffling files [2018-12-05 00:05:18] [data] Done [2018-12-05 00:11:30] Ep. 3 : Up. 53500 : Sen. 48331 : Cost 4.43 : Time 610.47s : 4284.36 words/s : L.r. 7.7339e-05 [2018-12-05 00:20:10] Ep. 3 : Up. 54000 : Sen. 135459 : Cost 4.34 : Time 520.11s : 4936.42 words/s : L.r. 7.6980e-05 [2018-12-05 00:28:52] Ep. 3 : Up. 54500 : Sen. 217532 : Cost 4.38 : Time 522.46s : 4927.19 words/s : L.r. 7.6626e-05 [2018-12-05 00:37:36] Ep. 3 : Up. 55000 : Sen. 302874 : Cost 4.39 : Time 523.87s : 4947.11 words/s : L.r. 7.6277e-05 [2018-12-05 00:37:36] Saving model weights and runtime parameters to model/model.npz.orig.npz [2018-12-05 00:37:41] Saving model weights and runtime parameters to model/model.npz [2018-12-05 00:37:46] Saving Adam parameters to model/model.npz.optimizer.npz [2018-12-05 00:38:04] Saving model weights and runtime parameters to model/model.npz.best-ce-mean-words.npz [2018-12-05 00:38:09] [valid] Ep. 3 : Up. 55000 : ce-mean-words : 3.87194 : new best [2018-12-05 00:38:15] Saving model weights and runtime parameters to model/model.npz.best-perplexity.npz [2018-12-05 00:38:21] [valid] Ep. 3 : Up. 55000 : perplexity : 48.0354 : new best [2018-12-05 00:40:30] Saving model weights and runtime parameters to model/model.npz.best-translation.npz [2018-12-05 00:40:35] [valid] Ep. 3 : Up. 55000 : translation : 2.19 : new best [2018-12-05 00:49:23] Ep. 3 : Up. 55500 : Sen. 389983 : Cost 4.37 : Time 707.09s : 3701.06 words/s : L.r. 7.5933e-05 [2018-12-05 00:58:12] Ep. 3 : Up. 56000 : Sen. 476315 : Cost 4.37 : Time 528.98s : 5016.82 words/s : L.r. 7.5593e-05 [2018-12-05 01:06:56] Ep. 3 : Up. 56500 : Sen. 559593 : Cost 4.42 : Time 524.28s : 5001.68 words/s : L.r. 7.5258e-05 [2018-12-05 01:15:40] Ep. 3 : Up. 57000 : Sen. 645933 : Cost 4.38 : Time 523.63s : 4954.64 words/s : L.r. 7.4927e-05 [2018-12-05 01:24:22] Ep. 3 : Up. 57500 : Sen. 730572 : Cost 4.39 : Time 522.36s : 4960.07 words/s : L.r. 7.4600e-05 [2018-12-05 01:33:06] Ep. 3 : Up. 58000 : Sen. 815061 : Cost 4.35 : Time 523.58s : 4951.24 words/s : L.r. 7.4278e-05

vkramgovind avatar Dec 05 '18 06:12 vkramgovind

Looks alright to me. If nvidia-smi is telling you that you some space left you could try to increase workspace.

emjotde avatar Dec 05 '18 06:12 emjotde

Sorry, the close was by mistake.

emjotde avatar Dec 05 '18 06:12 emjotde

Thanks for the quick reply. Will close the issue then.

vkramgovind avatar Dec 05 '18 06:12 vkramgovind

Your BLEU seems to be stuck though. I would expect to have much better scores at that point. I usually train transformer-big models on multiple GPUs which results in larger batch-size and more stable training. Maybe try --sync-sgd --optimizer-delay 4 with the marian-dev repo. The main repo has a small bug with --sync-sgd which I just fixed in marian-dev today.

emjotde avatar Dec 05 '18 07:12 emjotde

Hi ,

I tried with configuration suggested. Still BLEU seems to be stuck around 2.2 after 3rd Epoch.

Transformer base seems to work well for Single GPU.

Any other pointers for Transformer-big or does it need multi GPU set up?

Thanks.

vkramgovind avatar Dec 06 '18 06:12 vkramgovind

Can you post the log?

emjotde avatar Dec 06 '18 06:12 emjotde

Hi..

I have attached train.log and valid.log . Please let me know if there is anything else you need train.log valid.log

vkramgovind avatar Dec 06 '18 07:12 vkramgovind

You are training with average-attention on purpose? And the data, is that from our example?

emjotde avatar Dec 06 '18 07:12 emjotde

Data is from the example only..average-attention was a little faster in inference compared to default in case of Transformer base. So I straight away tried average-attention for the Transformer-big

vkramgovind avatar Dec 06 '18 07:12 vkramgovind

OK. I will play around with that and let you know.

emjotde avatar Dec 06 '18 07:12 emjotde

Sorry to hijack this post :) Marcin, paper and website are not in line for training speeds on transformer. one says about 42K w/s other says 60k w/s both for 6 GPUs.

I am getting about 45K on 6 GTX 1080ti for the base transformer with onmt-py. is this comparable? What is the speed for big on 6 GPU ? I am as low as 15k w/s

Cheers.

vince62s avatar Dec 14 '18 20:12 vince62s

The faster numbers should be correct. If anything they are likely faster now.

For a transformer-big I am getting about 29,000 on 4 GPUs and 47,000 on 8 GPUs. 6 GPUs is a really bad number to use, as NCCL -- Nvidia's communication library -- works much worse for GPU numbers that are not powers of 2.

emjotde avatar Dec 14 '18 20:12 emjotde

That's on Titan XPs which I believe are comparable to 1080Ti?

emjotde avatar Dec 14 '18 20:12 emjotde

Hijacking since I'm working on GPU server procurement in Edinburgh, is a 10-GPU server reasonable if we tell people to use 8 for a NCCL job and 2 for unrelated things? 10 GPUs in 4U is about optimal density for our thermal setup.

kpu avatar Dec 14 '18 20:12 kpu

Sorry, wong numbers. 18,000 on 4 TitanXP, 30,500 on 8 TitanXP. No NVlink.

emjotde avatar Dec 14 '18 20:12 emjotde

@kpu Can you go for NVLink? I would happily sacrifice two GPUs in a 10 GPU setup if they have NVLink instead. This gives you near linear scaling. Numbers for Transformer-Big Float32.

GPUs 1 4 8
Volta (NVlink) 14,400 52,500 (3.7x) 104,000 (7.3x)
Titan Xp (No NVLink) 5,800 17,600 (3.0x) 30,500 (5.3x)

emjotde avatar Dec 14 '18 20:12 emjotde

@emjotde We're academics here. . . RTX 2080 Tis.

kpu avatar Dec 14 '18 21:12 kpu

I just posted results for Voltas as I do not have other GPUs with NVLink. Aren't the new RTX Tis NVLink capable? Or maybe the Titan RTX?

emjotde avatar Dec 14 '18 21:12 emjotde

@vince62s A transformer base is currently doing between 90,000 and 95,000 wps on 8 Titan XPs for us.

emjotde avatar Dec 14 '18 21:12 emjotde

do you fit approx 4096 tokens per GPU for base and 2048 tokens for big ? anyway you're about 25% faster I would say, makes sense. I'll have to work on fp16 anyway.

vince62s avatar Dec 14 '18 21:12 vince62s

@kpu for some reason there are issues with RTX 2080 ti on some i7 cpu. not getting full speeds.

vince62s avatar Dec 14 '18 21:12 vince62s

This is not fp16, full 32-bit.

Honestly, I do not know how many words that are per batch. The auto-fitting relieves me from having to know that :)

emjotde avatar Dec 14 '18 21:12 emjotde

oh yeah I know it's fp32, but with fp16 supposed to x3 according to FB, right ?

vince62s avatar Dec 14 '18 21:12 vince62s

On Voltas though, they have extra processors for that.

emjotde avatar Dec 14 '18 21:12 emjotde

In fact I am at 21k on a DE to EN big on 6 GPUs. I don't see any issue with 6. Seems quite almost linear vs 4. I didn't know cuda supports NVlink, (since it did not for SLI think). RTX 2080 ti supports NVlink

vince62s avatar Dec 16 '18 18:12 vince62s

In case you're interested, I was mistaken before because I used paracrawl which is in fact largely not clean at all (german segments in the english side and vice versa), leading to src/tgt vocab of 41K even though BPE was set 32K. I'll see what I can do to clean it.

vince62s avatar Dec 16 '18 18:12 vince62s

Maybe you are not using NCCL? My hand-written communicator scales also well too 6 GPUs, but is otherwise slower. And NVLink makes a huge difference.

emjotde avatar Dec 16 '18 18:12 emjotde

I am using nccl2 backend. I'll check this nvlink thing not sure if pytorch already take advantage of it. is nvlink just pairing 2 GPUs or can it link as many GPU as we can?

vince62s avatar Dec 16 '18 18:12 vince62s

I think NCCL uses this automatically, I am not doing anything for that. I believe I saw bridges for up to 4 GPUs? Not sure.

emjotde avatar Dec 16 '18 19:12 emjotde

AFAIK, NVLink works for all GPUs inside a single box. Note sure if cross-box setups are possible and/or common.

Can you try to set this environment variable and run it again. It will write out lots of configuration information during the NCCL setup. It may tell us whether it uses NVLink or not:

NCCL_DEBUG=INFO

You should see something like this, but hopefully it would say something about NVLink instead of shared memory:

pworker0137:111:125 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via direct shared memory
pworker0137:111:123 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
pworker0137:111:126 [3] NCCL INFO Ring 00 : 3[3] -> 0[0] via direct shared memory
pworker0137:111:124 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory

frankseide avatar Dec 17 '18 17:12 frankseide