marian-dev
marian-dev copied to clipboard
What's the effect of decoder --mini-batch size?
Hi. I thought --mini-batch size for marian-decoder controls how many sentences are translated at once. I expected to be able to trade GPU memory for time, but I only see (predictable) increase of memory usage but no change in translation time. I left beam default and I am not asking for nbestlists. Am I missing something? Thanks, Ondrej.
That's what it does. Can you post your full command or config? There may be some interactions with other stuff. Depending on the model-type it should be 10x faster or more. I can translate a WMT test set in 5 seconds on 4 GPUs with a mini-batch size of 64.
GeForce GTX 1070, 8GB RAM The runner command is: marian-decoder --models model.iter150000.npz model.iter200000.npz model.iter250000.npz --vocabs train.src.gz.yml train.tgt.gz.yml --maxi-batch 500 --mini-batch 100
[2018-02-18 00:24:19] [config] allow-unk: false
[2018-02-18 00:24:19] [config] beam-size: 12
[2018-02-18 00:24:19] [config] best-deep: false
[2018-02-18 00:24:19] [config] dec-cell: lstm
[2018-02-18 00:24:19] [config] dec-cell-base-depth: 4
[2018-02-18 00:24:19] [config] dec-cell-high-depth: 2
[2018-02-18 00:24:19] [config] dec-depth: 4
[2018-02-18 00:24:19] [config] devices:
[2018-02-18 00:24:19] [config] - 0
[2018-02-18 00:24:19] [config] dim-emb: 512
[2018-02-18 00:24:19] [config] dim-rnn: 1024
[2018-02-18 00:24:19] [config] dim-vocabs:
[2018-02-18 00:24:19] [config] - 80196
[2018-02-18 00:24:19] [config] - 90158
[2018-02-18 00:24:19] [config] enc-cell: lstm
[2018-02-18 00:24:19] [config] enc-cell-depth: 2
[2018-02-18 00:24:19] [config] enc-depth: 4
[2018-02-18 00:24:19] [config] enc-type: alternating
[2018-02-18 00:24:19] [config] ignore-model-config: false
[2018-02-18 00:24:19] [config] input:
[2018-02-18 00:24:19] [config] - stdin
[2018-02-18 00:24:19] [config] layer-normalization: true
[2018-02-18 00:24:19] [config] log-level: info
[2018-02-18 00:24:19] [config] max-length: 1000
[2018-02-18 00:24:19] [config] max-length-crop: false
[2018-02-18 00:24:19] [config] maxi-batch: 500
[2018-02-18 00:24:19] [config] maxi-batch-sort: none
[2018-02-18 00:24:19] [config] mini-batch: 100
[2018-02-18 00:24:19] [config] models:
[2018-02-18 00:24:19] [config] - model.iter150000.npz
[2018-02-18 00:24:19] [config] - model.iter200000.npz
[2018-02-18 00:24:19] [config] - model.iter250000.npz
[2018-02-18 00:24:19] [config] n-best: false
[2018-02-18 00:24:19] [config] normalize: 0
[2018-02-18 00:24:19] [config] port: 8080
[2018-02-18 00:24:19] [config] quiet: false
[2018-02-18 00:24:19] [config] quiet-translation: false
[2018-02-18 00:24:19] [config] relative-paths: false
[2018-02-18 00:24:19] [config] right-left: false
[2018-02-18 00:24:19] [config] seed: 0
[2018-02-18 00:24:19] [config] skip: true
[2018-02-18 00:24:19] [config] tied-embeddings: true
[2018-02-18 00:24:19] [config] tied-embeddings-all: false
[2018-02-18 00:24:19] [config] tied-embeddings-src: false
[2018-02-18 00:24:19] [config] transformer-dim-ffn: 2048
[2018-02-18 00:24:19] [config] transformer-heads: 8
[2018-02-18 00:24:19] [config] transformer-postprocess: dan
[2018-02-18 00:24:19] [config] transformer-postprocess-emb: d
[2018-02-18 00:24:19] [config] transformer-preprocess: ""
[2018-02-18 00:24:19] [config] type: s2s
[2018-02-18 00:24:19] [config] version: v1.0.0+ac05ddd
[2018-02-18 00:24:19] [config] vocabs:
[2018-02-18 00:24:19] [config] - train.src.gz.yml
[2018-02-18 00:24:19] [config] - train.tgt.gz.yml
[2018-02-18 00:24:19] [config] workspace: 512
[2018-02-18 00:24:19] [config] Model created with Marian v1.0.0+ac05ddd
[2018-02-18 00:24:19] [data] Loading vocabulary from train.src.gz.yml
[2018-02-18 00:24:19] [data] Setting vocabulary size for input 0 to 80196
[2018-02-18 00:24:19] [data] Loading vocabulary from train.tgt.gz.yml
[2018-02-18 00:24:21] [memory] Extending reserved space to 512 MB (device 0)
[2018-02-18 00:24:21] Loading scorer of type s2s as feature F0
[2018-02-18 00:24:21] Loading scorer of type s2s as feature F1
[2018-02-18 00:24:21] Loading scorer of type s2s as feature F2
[2018-02-18 00:24:21] Loading model from model.iter150000.npz
[2018-02-18 00:24:26] Loading model from model.iter200000.npz
[2018-02-18 00:24:32] Loading model from model.iter250000.npz
[2018-02-18 00:24:38] [memory] Reserving 2967 MB, device 0
...and then the produced translations, they seem sorted.
I observe these: mini-batch, Total time, GPU mem used, as I recorded it 120s after marian-decoder startup:
10 374 3638MiB
20 380 4150MiB
30 347 4152MiB
40 338 4666MiB
50 363 4666MiB
60 358 5180MiB
80 362 5692MiB
90 358 6202MiB
100 362 6202MiB
110 366 6714MiB
120 362 7226MiB
130 378 7738MiB
140 371 7736MiB
As you see, the time is similar and only slightly grows.
All the models are three last checkpoints from a single training run with these flags:
--type s2s --enc-depth 4 --enc-type alternating --enc-cell lstm --enc-cell-depth 2 --dec-depth 4 --dec-cell lstm --dec-cell-base-depth 4 --dec-cell-high-depth 2 --tied-embeddings --skip --exponential-smoothing --sync-sgd --mini-batch-fit --layer-normalization --dropout-rnn 0.2 --dropout-src 0.1 --dropout-trg 0.1 --early-stopping 5 --save-freq 50000 --disp-freq 5000 --valid-freq 50000
Maybe the default workspace is too small?
Exactly. Set it as high as possible, I assume these are pretty large models and you chose a large mini-batch size. So, based on the usage, try -w 5000 or higher if it still fits. It reallocates memory in 512 MB steps and that can take a while. So you are probably using a lot of time due to constant reallocation.
alternatively, try mini-batch 32, that's not much slower.
The workspace pre-allocation is not very clear to me. Asking for 5000 doesn't fit.
The model was trained on four 1080Ti with 11GB RAM. I tried a few workspace sizes for training and observed that "the model needs" about 5000, because I can set workspace for training to 6172 (5+6=11).
Depending on the optimizer, the model alone may need about the same amount of memory or less (no gradients, no moments), so conservatively, workspace of 8-5=3 should fit. I now see that also -ws 4500 fits.
Do I get it correctly that:
- model needs what it needs (parameters, then gradients and moments in training)
- workspace should be ideally set to eat up all the remaining GPU RAM (couldn't Marian just have a look and use all available after allocating the model?)
- batch size is then used from the pre-allocated workspace (but it can possibly go beyond it, if there is extra free RAM?)
Trying various batch sizes with ws 4500 now.
-w 5000 was just an educated guess, if 4500 (still close to 5000) works, that's fine.
- Yes. Workspace is only used for forward and backward expansion.
- Depends, if you use
--mini-batch-fit
then yes, it will then use all available workspace. Making it finding things out is very non-trivial, as parameters get allocated dynamically. For instance the optimizer memory gets allocated for the first time after the first forward/backward expansion. It cannot be done earlier because the parameters are created in that step and the optimizer needs to know how many parameters there are. Since graphs are dynamic, you do not really know how many parameters there are ... you see how this goes on :) I agree this is not ideal. I do not yet have a good idea how to change that. I guess the statistics step would need to involve a whole training step to check if everything fits, a bit involved. The reason we do not have this right now is not that we are lazy, it's just really hard :) - batch size during translation is set by hand, during training determined automatically with
--mini-batch-fit
. Doing it automatically for translation would probably be possible too, but again a lot less trivial. With--mini-batch-fit
memory is essentially guaranteed to not be expanded, if it does, it's a bug. There is no such guarantee for translation.
Hi,
thanks for the explanation.
Here are the times for translating 2000 segments with workspace of 4500: (mini-batch size, total time, RAM used)
10 381 7734MiB
20 398 7734MiB
30 344 7736MiB
32 337 7736MiB
40 342 7738MiB
50 364 7740MiB
60 362 7738MiB
70 362 7740MiB
80 365 7738MiB
90 352 7738MiB
100 352 7738MiB
110 348 7738MiB
120 345 7738MiB
130 360 7738MiB
140 356 7736MiB
I use the RAM reasonably tighly, but I still don't see the speedup I'd expect. Any idea?
Thanks, O.
What’s the speed up vs. –mini-batch 1 ? There is not much expected speed up going up from 10 to 100, especially not with a complicated set-up like yours where it executes the computations per model in the ensemble in series.
Also check powers of 2 or sums of them: 32, 64, 96, 128. There is a religious belief that this might optimize better. Compare 30 to 32 to 40 in your chart.
--mini-batch 1 took 1014 seconds, so about 3x more. I'm evaluating the religion now.
There would be more expected speed up for your setup if I ever come around to implementing proper auto-batching. Then all three models would be executed simultaneously instead of after one another. That would help. How does a single model behave?
There is some truth in this religion
Also, make sure --maxi-batch is a multiple of mini-batch, and add --maxi-batch-sort src
oh right, forgot about --maxi-batch-sort src
. THAT should make a difference.
@hieuhoang --maxi-batch
in marian is already a multiplicative factor of --mini-batch
ah, there's a difference between marian and amun's maxi-batch definition then. Marian's is better imo
@emjotde are you using --maxi-batch to buffer input too during training? Otherwise I see no point in not sorting maxi-batches
I don't have all the numbers yet, but --maxi-batch-sort src is the key. I thought I saw somewhere that the default was src, but all my previous runs indeed had none.
So in my case, mini-batch of 32 and sort src is the best tradeof. Going higher with batch size does not save much wallclock time (but the available memory is not going to be of any use anyway).
Some of the points are powers of 2, some are no. Too little data to confirm the religion. (But thanks a lot, Hieu!)
So, the takeaway here is, we should make --maxi-batch-sort src
default for translation. It is trg
by-default for training. Should we go further and make batched translation the default, i.e. --mini-batch 32 --maxi-batch 100 --maxi-batch src
?
I am also wondering about the definition of maxi-batch
. Currently it's a multiplier of mini-batch
. That's sort of fine for translation. I am starting to hate that for training though. Especially since the recommended way to train is using --mini-batch-fit
which ignores --mini-batch
and the multiplier becomes a bit weird as it still refers to the ignored --mini-batch
. So you have to set two parameters to define how much is being preloaded.
@emjotde I am confused about the relationship between --mini-batch-fit and --maxi-batch. if I use --mini-batch-fit, What is the use of --maxi-batch?
With --mini-batch-fit
these two parameters still determine how many sentences is loaded for preparing batches, it's mini-batch
x maxi-batch
.
Thanks for you answer ! But if I use --mini-batch-fit, is --mini-batch useless? If so, why use this setting in the script?(--mini-batch-fit -w $WORKSPACE --mini-batch 1000 --maxi-batch 1000 )
So, the takeaway here is, we should make
--maxi-batch-sort src
default for translation. It istrg
by-default for training. Should we go further and make batched translation the default, i.e.--mini-batch 32 --maxi-batch 100 --maxi-batch src
?
Yes, you should make --maxi-batch-sort src
the default for translation. Currently it is none
which is just terrible. And batch translation should be the default too.
Will change at least the sorting.