marian-dev What's the effect of decoder --mini-batch size?

Hi. I thought --mini-batch size for marian-decoder controls how many sentences are translated at once. I expected to be able to trade GPU memory for time, but I only see (predictable) increase of memory usage but no change in translation time. I left beam default and I am not asking for nbestlists. Am I missing something? Thanks, Ondrej.

Feb 18 '18 21:02 obo

That's what it does. Can you post your full command or config? There may be some interactions with other stuff. Depending on the model-type it should be 10x faster or more. I can translate a WMT test set in 5 seconds on 4 GPUs with a mini-batch size of 64.

Feb 18 '18 21:02 emjotde

GeForce GTX 1070, 8GB RAM The runner command is: marian-decoder --models model.iter150000.npz model.iter200000.npz model.iter250000.npz --vocabs train.src.gz.yml train.tgt.gz.yml --maxi-batch 500 --mini-batch 100

[2018-02-18 00:24:19] [config] allow-unk: false
[2018-02-18 00:24:19] [config] beam-size: 12
[2018-02-18 00:24:19] [config] best-deep: false
[2018-02-18 00:24:19] [config] dec-cell: lstm
[2018-02-18 00:24:19] [config] dec-cell-base-depth: 4
[2018-02-18 00:24:19] [config] dec-cell-high-depth: 2
[2018-02-18 00:24:19] [config] dec-depth: 4
[2018-02-18 00:24:19] [config] devices:
[2018-02-18 00:24:19] [config]   - 0
[2018-02-18 00:24:19] [config] dim-emb: 512
[2018-02-18 00:24:19] [config] dim-rnn: 1024
[2018-02-18 00:24:19] [config] dim-vocabs:
[2018-02-18 00:24:19] [config]   - 80196
[2018-02-18 00:24:19] [config]   - 90158
[2018-02-18 00:24:19] [config] enc-cell: lstm
[2018-02-18 00:24:19] [config] enc-cell-depth: 2
[2018-02-18 00:24:19] [config] enc-depth: 4
[2018-02-18 00:24:19] [config] enc-type: alternating
[2018-02-18 00:24:19] [config] ignore-model-config: false
[2018-02-18 00:24:19] [config] input:
[2018-02-18 00:24:19] [config]   - stdin
[2018-02-18 00:24:19] [config] layer-normalization: true
[2018-02-18 00:24:19] [config] log-level: info
[2018-02-18 00:24:19] [config] max-length: 1000
[2018-02-18 00:24:19] [config] max-length-crop: false
[2018-02-18 00:24:19] [config] maxi-batch: 500
[2018-02-18 00:24:19] [config] maxi-batch-sort: none
[2018-02-18 00:24:19] [config] mini-batch: 100
[2018-02-18 00:24:19] [config] models:
[2018-02-18 00:24:19] [config]   - model.iter150000.npz
[2018-02-18 00:24:19] [config]   - model.iter200000.npz
[2018-02-18 00:24:19] [config]   - model.iter250000.npz
[2018-02-18 00:24:19] [config] n-best: false
[2018-02-18 00:24:19] [config] normalize: 0
[2018-02-18 00:24:19] [config] port: 8080
[2018-02-18 00:24:19] [config] quiet: false
[2018-02-18 00:24:19] [config] quiet-translation: false
[2018-02-18 00:24:19] [config] relative-paths: false
[2018-02-18 00:24:19] [config] right-left: false
[2018-02-18 00:24:19] [config] seed: 0
[2018-02-18 00:24:19] [config] skip: true
[2018-02-18 00:24:19] [config] tied-embeddings: true
[2018-02-18 00:24:19] [config] tied-embeddings-all: false
[2018-02-18 00:24:19] [config] tied-embeddings-src: false
[2018-02-18 00:24:19] [config] transformer-dim-ffn: 2048
[2018-02-18 00:24:19] [config] transformer-heads: 8
[2018-02-18 00:24:19] [config] transformer-postprocess: dan
[2018-02-18 00:24:19] [config] transformer-postprocess-emb: d
[2018-02-18 00:24:19] [config] transformer-preprocess: ""
[2018-02-18 00:24:19] [config] type: s2s
[2018-02-18 00:24:19] [config] version: v1.0.0+ac05ddd
[2018-02-18 00:24:19] [config] vocabs:
[2018-02-18 00:24:19] [config]   - train.src.gz.yml
[2018-02-18 00:24:19] [config]   - train.tgt.gz.yml
[2018-02-18 00:24:19] [config] workspace: 512
[2018-02-18 00:24:19] [config] Model created with Marian v1.0.0+ac05ddd
[2018-02-18 00:24:19] [data] Loading vocabulary from train.src.gz.yml
[2018-02-18 00:24:19] [data] Setting vocabulary size for input 0 to 80196
[2018-02-18 00:24:19] [data] Loading vocabulary from train.tgt.gz.yml
[2018-02-18 00:24:21] [memory] Extending reserved space to 512 MB (device 0)
[2018-02-18 00:24:21] Loading scorer of type s2s as feature F0
[2018-02-18 00:24:21] Loading scorer of type s2s as feature F1
[2018-02-18 00:24:21] Loading scorer of type s2s as feature F2
[2018-02-18 00:24:21] Loading model from model.iter150000.npz
[2018-02-18 00:24:26] Loading model from model.iter200000.npz
[2018-02-18 00:24:32] Loading model from model.iter250000.npz
[2018-02-18 00:24:38] [memory] Reserving 2967 MB, device 0

...and then the produced translations, they seem sorted.

I observe these: mini-batch, Total time, GPU mem used, as I recorded it 120s after marian-decoder startup:

10  374 3638MiB
20  380 4150MiB
30  347 4152MiB
40  338 4666MiB
50  363 4666MiB
60  358 5180MiB
80  362 5692MiB
90  358 6202MiB
100 362 6202MiB
110 366 6714MiB
120 362 7226MiB
130 378 7738MiB
140 371 7736MiB

As you see, the time is similar and only slightly grows.

All the models are three last checkpoints from a single training run with these flags:

--type s2s --enc-depth 4 --enc-type alternating --enc-cell lstm --enc-cell-depth 2 --dec-depth 4 --dec-cell lstm --dec-cell-base-depth 4 --dec-cell-high-depth 2 --tied-embeddings --skip --exponential-smoothing --sync-sgd --mini-batch-fit --layer-normalization --dropout-rnn 0.2 --dropout-src 0.1 --dropout-trg 0.1 --early-stopping 5 --save-freq 50000 --disp-freq 5000 --valid-freq 50000

Maybe the default workspace is too small?

Feb 18 '18 21:02 obo

Exactly. Set it as high as possible, I assume these are pretty large models and you chose a large mini-batch size. So, based on the usage, try -w 5000 or higher if it still fits. It reallocates memory in 512 MB steps and that can take a while. So you are probably using a lot of time due to constant reallocation.

Feb 18 '18 21:02 emjotde

alternatively, try mini-batch 32, that's not much slower.

Feb 18 '18 21:02 emjotde

The workspace pre-allocation is not very clear to me. Asking for 5000 doesn't fit.

The model was trained on four 1080Ti with 11GB RAM. I tried a few workspace sizes for training and observed that "the model needs" about 5000, because I can set workspace for training to 6172 (5+6=11).

Depending on the optimizer, the model alone may need about the same amount of memory or less (no gradients, no moments), so conservatively, workspace of 8-5=3 should fit. I now see that also -ws 4500 fits.

Do I get it correctly that:

model needs what it needs (parameters, then gradients and moments in training)
workspace should be ideally set to eat up all the remaining GPU RAM (couldn't Marian just have a look and use all available after allocating the model?)
batch size is then used from the pre-allocated workspace (but it can possibly go beyond it, if there is extra free RAM?)

Trying various batch sizes with ws 4500 now.

Feb 18 '18 22:02 obo

-w 5000 was just an educated guess, if 4500 (still close to 5000) works, that's fine.

Yes. Workspace is only used for forward and backward expansion.
Depends, if you use --mini-batch-fit then yes, it will then use all available workspace. Making it finding things out is very non-trivial, as parameters get allocated dynamically. For instance the optimizer memory gets allocated for the first time after the first forward/backward expansion. It cannot be done earlier because the parameters are created in that step and the optimizer needs to know how many parameters there are. Since graphs are dynamic, you do not really know how many parameters there are ... you see how this goes on :) I agree this is not ideal. I do not yet have a good idea how to change that. I guess the statistics step would need to involve a whole training step to check if everything fits, a bit involved. The reason we do not have this right now is not that we are lazy, it's just really hard :)
batch size during translation is set by hand, during training determined automatically with --mini-batch-fit. Doing it automatically for translation would probably be possible too, but again a lot less trivial. With --mini-batch-fit memory is essentially guaranteed to not be expanded, if it does, it's a bug. There is no such guarantee for translation.

Feb 18 '18 22:02 emjotde

Hi,

thanks for the explanation.

Here are the times for translating 2000 segments with workspace of 4500: (mini-batch size, total time, RAM used)

10  381 7734MiB
20  398 7734MiB
30  344 7736MiB
32  337 7736MiB
40  342 7738MiB
50  364 7740MiB
60  362 7738MiB
70  362 7740MiB
80  365 7738MiB
90  352 7738MiB
100 352 7738MiB
110 348 7738MiB
120 345 7738MiB
130 360 7738MiB
140 356 7736MiB

I use the RAM reasonably tighly, but I still don't see the speedup I'd expect. Any idea?

Thanks, O.

Feb 19 '18 08:02 obo

What’s the speed up vs. –mini-batch 1 ? There is not much expected speed up going up from 10 to 100, especially not with a complicated set-up like yours where it executes the computations per model in the ensemble in series.

Feb 19 '18 08:02 emjotde

Also check powers of 2 or sums of them: 32, 64, 96, 128. There is a religious belief that this might optimize better. Compare 30 to 32 to 40 in your chart.

Feb 19 '18 08:02 emjotde

--mini-batch 1 took 1014 seconds, so about 3x more. I'm evaluating the religion now.

Feb 19 '18 09:02 obo

There would be more expected speed up for your setup if I ever come around to implementing proper auto-batching. Then all three models would be executed simultaneously instead of after one another. That would help. How does a single model behave?

Feb 19 '18 09:02 emjotde

There is some truth in this religion Also, make sure --maxi-batch is a multiple of mini-batch, and add --maxi-batch-sort src

Feb 19 '18 09:02 hieuhoang

oh right, forgot about --maxi-batch-sort src. THAT should make a difference.

Feb 19 '18 09:02 emjotde

@hieuhoang --maxi-batch in marian is already a multiplicative factor of --mini-batch

Feb 19 '18 12:02 snukky

ah, there's a difference between marian and amun's maxi-batch definition then. Marian's is better imo

Feb 19 '18 12:02 hieuhoang

@emjotde are you using --maxi-batch to buffer input too during training? Otherwise I see no point in not sorting maxi-batches

Feb 19 '18 19:02 hieuhoang

I don't have all the numbers yet, but --maxi-batch-sort src is the key. I thought I saw somewhere that the default was src, but all my previous runs indeed had none.

So in my case, mini-batch of 32 and sort src is the best tradeof. Going higher with batch size does not save much wallclock time (but the available memory is not going to be of any use anyway).

Some of the points are powers of 2, some are no. Too little data to confirm the religion. (But thanks a lot, Hieu!)

Feb 19 '18 21:02 obo

So, the takeaway here is, we should make --maxi-batch-sort src default for translation. It is trg by-default for training. Should we go further and make batched translation the default, i.e. --mini-batch 32 --maxi-batch 100 --maxi-batch src?

I am also wondering about the definition of maxi-batch. Currently it's a multiplier of mini-batch. That's sort of fine for translation. I am starting to hate that for training though. Especially since the recommended way to train is using --mini-batch-fit which ignores --mini-batch and the multiplier becomes a bit weird as it still refers to the ignored --mini-batch. So you have to set two parameters to define how much is being preloaded.

Feb 20 '18 02:02 emjotde

@emjotde I am confused about the relationship between --mini-batch-fit and --maxi-batch. if I use --mini-batch-fit, What is the use of --maxi-batch?

Apr 18 '19 17:04 zhangxinlu16

With --mini-batch-fit these two parameters still determine how many sentences is loaded for preparing batches, it's mini-batch x maxi-batch.

Apr 18 '19 18:04 snukky

Thanks for you answer ! But if I use --mini-batch-fit, is --mini-batch useless? If so, why use this setting in the script?（--mini-batch-fit -w $WORKSPACE --mini-batch 1000 --maxi-batch 1000 ）

Apr 19 '19 08:04 zhangxinlu16

So, the takeaway here is, we should make --maxi-batch-sort src default for translation. It is trg by-default for training. Should we go further and make batched translation the default, i.e. --mini-batch 32 --maxi-batch 100 --maxi-batch src?

Yes, you should make --maxi-batch-sort src the default for translation. Currently it is none which is just terrible. And batch translation should be the default too.

Feb 09 '21 13:02 kpu

Will change at least the sorting.

Feb 09 '21 14:02 emjotde

marian-dev marian-dev copied to clipboard

What's the effect of decoder --mini-batch size?

marian-dev
marian-dev copied to clipboard